Here is an example from the internet,
# Simulate from probit model --
simulate <- function(N) {x1 <- 2*runif(N)
x2 <- 5*runif(N) ystar <- 7 + 3*x1 - 4*x2 + rnorm(N)
y <- as.numeric(ystar>0)
print(table(y))
data.frame(y, x1, x2, ystar)}
> # A simple example of estimation --
> D <- simulate(200)y0 197 103
> m <- glm(y~x1+x2, family=binomial(probit), data=D)
> summary(m)
Call:
glm(formula = y ~ x1 + x2, family = binomial(probit), data = D)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.125e+00 -7.332e-04 2.107e-08 7.812e-04 2.326e+00
Coefficients:
Estimate Std. Error z value Pr(>z)(Intercept)
8.0810 1.6723 4.832 1.35e-06 ***x1 3.1877 0.6948 4.588 4.47e-06 ***x2 -4.4544 0.8488 -5.248 1.54e-07 ***
---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)
Null deviance: 277.079 on 199 degrees of freedom
Residual deviance: 40.845 on 197 degrees of freedom
AIC: 46.845
Number of Fisher Scoring iterations: 10
For this one dataset of 200 observations, we got back estimate which is quite close to the true .
Let us look at how well this model predicts in-sample:
> # Predictions using the model --
> predictions <- predict(m)
> summary(predictions)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.0800 -4.1720 0.1780 0.3705 5.1250 12.7400
> yhat <- ifelse(predictions>0, 1, 0)
> table(yhat, D$y)
It does pretty well. For 93 cases, the truth is 0 and the predicted value is0. Similarly, for 98 cases, the truth is 1 and the predicted value is 1. There are only 9 cases which are misclassified.
---Joy, Passion, Pride and Love
---Stand on the Giants
---Make the World Beautiful
Yundong Tu's Webpage
Friday, April 23, 2010
Tuesday, April 13, 2010
Music Gallery
Last Updated : Mar 23, 2011
Are you the one? --- Timmo Tolkki
As Love Is My Witness --- Westlife
Bad Day --- By Daniel Powter
Beauty and Beast --- Celine Dion
Because You Loved Me --- Celine Dion
Are you the one? --- Timmo Tolkki
As Love Is My Witness --- Westlife
Bad Day --- By Daniel Powter
Beauty and Beast --- Celine Dion
Because You Loved Me --- Celine Dion
Big Girls Don't Cry --- Fergie
Born to Make you Happy --- Britney Spears
Can you feel the love tonight? --- Elton John
Can't Get Over You --- September
Casablanca --- Bertie Higgins
Born to Make you Happy --- Britney Spears
Can you feel the love tonight? --- Elton John
Can't Get Over You --- September
Casablanca --- Bertie Higgins
Complicated --- Avril Lavigne
Don't Be Stupid --- Shania Twain
Don't Go Away --- BY2 (Chinese Version)
Don't Go Away --- BY2 (English Version)
Don't Impress Me Much --- Shania Twain
(Everything I Do) I Do It For You --- By Bryan Adams
Everytime --- Britney Spears
Eyes on Me --- Celine Dion
Falling for You --- Colbie Caillat
From This Moment On --- Shania Twain
Don't Be Stupid --- Shania Twain
Don't Go Away --- BY2 (Chinese Version)
Don't Go Away --- BY2 (English Version)
Don't Impress Me Much --- Shania Twain
(Everything I Do) I Do It For You --- By Bryan Adams
Everytime --- Britney Spears
Eyes on Me --- Celine Dion
Falling for You --- Colbie Caillat
From This Moment On --- Shania Twain
Fuckin' Perfect --- Pink (P!nk)
Girlfriend --- Avril Lavigne
God Is a Girl --- Groove Coverage
Heaven Love --- Ying Wang (Chinese)
How Do I Live --- Leanne Rimes
How Have You Been --- S.H.E. (Chinese)
Hui Hu Xi de Tong --- Jinru Liang (Chinese)
I'm Your Angel --- Celine Dion and R. Kelly
I'm Yours ---- Jason Mraz
I'll be There for You ---- Rembrandts
I Love You --- S.H.E. (Chinese)
I see you --- Leona Lewis
I Surrender --- Celion Dion
I Wanna Know What Love Is --- Mariah Carey
I Will Always Love you --- Whitney Houston
If That's What Takes --- Celine Dion
It's All Coming Back To Me Now --- Celine Dion
K-Lovers --- Pin and Liang (Chinese)
Love Story --- Taylor Swrift
Loving Loving You --- Rene Liu (Chinese)
Lucky --- Jason Mraz and Colbie Caillat
Make It Possible --- By Xiaoming Huang (Chinese)
Miracle ---- Celine Dion
Missing You (Heart-Broken) --- Jie Xu (Chinese)
Missing You Suddenly --- Youjia Lin (Chinese)/Originally by Wuyuetian
My Heart Will Go On --- Celine Dion
Need You now --- Lady Antebellum
Nobody But You --- Wonder Girls
Only Love ---- By Trade Marks
Our Song --- Taylor Swift
Realize ---- By Colbie caillat
Shadows --- Westlife
She's Gone --- Steel Heart
That's the Way it is! --- Celine Dion
The Power of Love --- Celine Dion
There you'll be ----Faith Hill
To Love Your More --- Celine Dion
Truly Madly Deeply --- Savage Garden
U Make Me Wanna --- By Blue-in-one
Up --- Shania Twain
Wait For You --- Elliott Yamin
What About Now --- Westlife
When A Man Loves A Woman --- Michael Bolton
What Happen To Us --- S.H.E. (Chinese)
When I Fall in Love --- Celine Dion
When I need You --- Celine Dion
When You're Gone --- Avril Lavigne
Where We Are --- Westlife
You and I --- Celine Dion
You and I Both --- Jason Mraz
You Belong With Me --- Taylor Swift
You're not Alone --- Michael Jackson
You're Still the One! --- Shania Twain
You Raise Me Up --- Josh Groban
You've Got A Way --- Shania Twain
God Is a Girl --- Groove Coverage
Heaven Love --- Ying Wang (Chinese)
How Do I Live --- Leanne Rimes
How Have You Been --- S.H.E. (Chinese)
Hui Hu Xi de Tong --- Jinru Liang (Chinese)
I'm Your Angel --- Celine Dion and R. Kelly
I'm Yours ---- Jason Mraz
I'll be There for You ---- Rembrandts
I Love You --- S.H.E. (Chinese)
I see you --- Leona Lewis
I Surrender --- Celion Dion
I Wanna Know What Love Is --- Mariah Carey
I Will Always Love you --- Whitney Houston
If That's What Takes --- Celine Dion
It's All Coming Back To Me Now --- Celine Dion
K-Lovers --- Pin and Liang (Chinese)
Love Story --- Taylor Swrift
Loving Loving You --- Rene Liu (Chinese)
Lucky --- Jason Mraz and Colbie Caillat
Make It Possible --- By Xiaoming Huang (Chinese)
Miracle ---- Celine Dion
Missing You (Heart-Broken) --- Jie Xu (Chinese)
Missing You Suddenly --- Youjia Lin (Chinese)/Originally by Wuyuetian
My Heart Will Go On --- Celine Dion
Need You now --- Lady Antebellum
Nobody But You --- Wonder Girls
Only Love ---- By Trade Marks
Our Song --- Taylor Swift
Realize ---- By Colbie caillat
Shadows --- Westlife
She's Gone --- Steel Heart
That's the Way it is! --- Celine Dion
The Power of Love --- Celine Dion
There you'll be ----Faith Hill
To Love Your More --- Celine Dion
Truly Madly Deeply --- Savage Garden
U Make Me Wanna --- By Blue-in-one
Up --- Shania Twain
Wait For You --- Elliott Yamin
What About Now --- Westlife
When A Man Loves A Woman --- Michael Bolton
What Happen To Us --- S.H.E. (Chinese)
When I Fall in Love --- Celine Dion
When I need You --- Celine Dion
When You're Gone --- Avril Lavigne
Where We Are --- Westlife
You and I --- Celine Dion
You and I Both --- Jason Mraz
You Belong With Me --- Taylor Swift
You're not Alone --- Michael Jackson
You're Still the One! --- Shania Twain
You Raise Me Up --- Josh Groban
You've Got A Way --- Shania Twain
Monday, April 12, 2010
A New Era: A Beautiful Mind
Look at this guy, who is changing this world in a way beyond your imagination. This world is so beautiful, because of beautiful minds as such that are on the way to make it perfect. Be creative and be valuable to the world. Make a change from now on, in a better way!
The MIT Genius
http://v.youku.com/v_show/id_XMTQ0MTM5Njg0.html
The MIT Genius
http://v.youku.com/v_show/id_XMTQ0MTM5Njg0.html
Thursday, April 8, 2010
R: Linear Regression Example of Faraway (2002)
See pp.23 of Faraway (2002), Practical Regression and Anova using R at:
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
> data(gala)
> gala
> gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
> summary(gfit)
The following is to compute some statistics by writing R code, just for illustrative purpose (since these estimates are already computed by the function lm())
> x <- cbind(1,gala[,-c(1,2)])
> y <- gala$Species
> t(x) %*% x
Error: %*% requires numeric matrix/vector arguments
> x <- as.matrix(x)
> t(x) %*% x
> xtxi <- solve(t(x) %*% x)
> xtxi
># Get it using lm()
> gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
> gs <- summary(gfit)
> gs$cov.unscaled
> names(gs)
> names(gfit)
> gfit$fit
> gfit$res
>#Get beta_hat by
> xtxi %*% t(x) %*% y #or
> solve(t(x) %*% x, t(x) %*% y)
> sqrt(sum(gfit$resˆ2)/(30-6))
> sqrt(diag(xtxi))*60.975
> #R^2
> 1-sum(gfit$resˆ2)/sum((y-mean(y))ˆ2)
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf
> data(gala)
> gala
> gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
> summary(gfit)
The following is to compute some statistics by writing R code, just for illustrative purpose (since these estimates are already computed by the function lm())
> x <- cbind(1,gala[,-c(1,2)])
> y <- gala$Species
> t(x) %*% x
Error: %*% requires numeric matrix/vector arguments
> x <- as.matrix(x)
> t(x) %*% x
> xtxi <- solve(t(x) %*% x)
> xtxi
># Get it using lm()
> gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
> gs <- summary(gfit)
> gs$cov.unscaled
> names(gs)
> names(gfit)
> gfit$fit
> gfit$res
>#Get beta_hat by
> xtxi %*% t(x) %*% y #or
> solve(t(x) %*% x, t(x) %*% y)
> sqrt(sum(gfit$resˆ2)/(30-6))
> sqrt(diag(xtxi))*60.975
> #R^2
> 1-sum(gfit$resˆ2)/sum((y-mean(y))ˆ2)
Monday, April 5, 2010
R: replacing for loop by apply
The function "apply" in R can help to save a lot computational time when it is replacing for loops. The following is just an example illustrating this point.
time1=proc.time()
I=50;B=50;x=matrix(0,nrow=I,ncol=B);
for (i in 1:I){
for (b in 1:B){ x=matrix(rnorm(2500),nrow=50); eigen(x) }
}
time1=proc.time()-time1
time2=proc.time()
I=matrix(1:50,nrow=1); B=matrix(1:50,nrow=1);
eig2<-function(x1=1){
x=matrix(rnorm(2500),nrow=50);
eigen(x);
}
eig1<-function(x2=1){apply(B,1,eig2); }
apply(I,1,eig1);time2=proc.time()-time2
time1time2
#> time1
# user system elapsed
# 20.62 0.11 20.97
#> time2
# user system elapsed
# 0.07 0.11 0.17
time1=proc.time()
I=50;B=50;x=matrix(0,nrow=I,ncol=B);
for (i in 1:I){
for (b in 1:B){ x=matrix(rnorm(2500),nrow=50); eigen(x) }
}
time1=proc.time()-time1
time2=proc.time()
I=matrix(1:50,nrow=1); B=matrix(1:50,nrow=1);
eig2<-function(x1=1){
x=matrix(rnorm(2500),nrow=50);
eigen(x);
}
eig1<-function(x2=1){apply(B,1,eig2); }
apply(I,1,eig1);time2=proc.time()-time2
time1time2
#> time1
# user system elapsed
# 20.62 0.11 20.97
#> time2
# user system elapsed
# 0.07 0.11 0.17
Sunday, April 4, 2010
A plot example
This example is taken from “Introduction to the R Project for Statistical Computing for Use at the ITC” by David Rossiter, pp.133, http://cran.r-project.org/doc/contrib/Rossiter-RIntro-ITC.pdf
n <- 30; mu <- 180; sd <- 20; bin.width <- sd/3v <- rnorm(n, mu, sd)
hist(v, breaks = seq(mu-5*sd, mu+5*sd, by=bin.width))
points(x, dnorm(x, mu, sd)*(n*bin.width))
### visualise the variability of small random samples
# sample size
n <- 30
# number of plot rows, columns
rows <- 2; cols <- 2;
reps <- rows*cols
# parameters of the normal distribution
mu <- 180; sd <- 20
# set up graphic display
par(mfrow=c(rows, cols))
# number of s.d.'s for histogram displays
dd <- 3.5
# compute bin width from s.d and
# the number of bars for each
bin.width=sd/3
# scale x-axis
x.min <- mu-(sdd*sd); x.max <- mu+(sdd*sd)
# scale y-axis
y.max <- n*0.5*bin.width/sd
# compute and display each graph
for (i in 1:reps)
{
v <- rnorm(n, mu, sd)
hist(v, xlim=c(x.min,x.max),
ylim=c(0, y.max),breaks = seq(mu-5*sd, mu+5*sd, by=bin.width),
main="", xlab=paste("Sample",i)) ;
x <- seq(x.min, x.max, length=120)
# true normal distribution
points(x,dnorm(x, mu, sd)*(n*bin.width),type="l", col="blue", lty=1, lwd=1.8)
# distribution estimated from sample
points(x,dnorm(x, mean(v), sd(v))*(n*bin.width),type="l", col="red", lty=2, lwd=1.8)
# print sample params.
# and Pr(Type I error)
text(x.min, 0.9*y.max, paste("mean:", round(mean(v),2)),pos=4)
text(x.min, 0.8*y.max, paste("sdev:", round(sd(v),2)),pos=4)
text(x.min, 0.7*y.max,paste("Pr(t):", round((t.test(v, mu=mu))$p.value,2)),pos=4)
}
# clean up
par(mfrow=c(1,1))
rm(n, rows, cols, reps, mu, sd, v, i, sdd, bin.width, x.min, x.max, y.max, x)
n <- 30; mu <- 180; sd <- 20; bin.width <- sd/3v <- rnorm(n, mu, sd)
hist(v, breaks = seq(mu-5*sd, mu+5*sd, by=bin.width))
points(x, dnorm(x, mu, sd)*(n*bin.width))
### visualise the variability of small random samples
# sample size
n <- 30
# number of plot rows, columns
rows <- 2; cols <- 2;
reps <- rows*cols
# parameters of the normal distribution
mu <- 180; sd <- 20
# set up graphic display
par(mfrow=c(rows, cols))
# number of s.d.'s for histogram displays
dd <- 3.5
# compute bin width from s.d and
# the number of bars for each
bin.width=sd/3
# scale x-axis
x.min <- mu-(sdd*sd); x.max <- mu+(sdd*sd)
# scale y-axis
y.max <- n*0.5*bin.width/sd
# compute and display each graph
for (i in 1:reps)
{
v <- rnorm(n, mu, sd)
hist(v, xlim=c(x.min,x.max),
ylim=c(0, y.max),breaks = seq(mu-5*sd, mu+5*sd, by=bin.width),
main="", xlab=paste("Sample",i)) ;
x <- seq(x.min, x.max, length=120)
# true normal distribution
points(x,dnorm(x, mu, sd)*(n*bin.width),type="l", col="blue", lty=1, lwd=1.8)
# distribution estimated from sample
points(x,dnorm(x, mean(v), sd(v))*(n*bin.width),type="l", col="red", lty=2, lwd=1.8)
# print sample params.
# and Pr(Type I error)
text(x.min, 0.9*y.max, paste("mean:", round(mean(v),2)),pos=4)
text(x.min, 0.8*y.max, paste("sdev:", round(sd(v),2)),pos=4)
text(x.min, 0.7*y.max,paste("Pr(t):", round((t.test(v, mu=mu))$p.value,2)),pos=4)
}
# clean up
par(mfrow=c(1,1))
rm(n, rows, cols, reps, mu, sd, v, i, sdd, bin.width, x.min, x.max, y.max, x)
tapply for aggregate statistics
The use of apply function can avoid a for loop and it leads to faster computing. This example is about how to use tapply to compute some group/aggregated statistics.
Use tapply for most basic by processing (By processing = Aggregate statistics = stratified estimates; statistics computedafter cross-classifying data)
> y <- 1:8
> sex <-c(rep(’male’,4),rep(’female’,4))
> treat <-rep(c(’A’,’B’),4)
> sex
[1] "male" "male" "male" "male" "female" "female" "female" "female"
> treat
[1] "A" "B" "A" "B" "A" "B" "A" "B"
> tapply(y, sex, mean)
female male
6.5 2.5
> tapply(y, treat, mean)
A B
4 5
> tapply(y, list(sex,treat), mean)
A B
female 6 7
male 2 3
Use tapply for most basic by processing (By processing = Aggregate statistics = stratified estimates; statistics computedafter cross-classifying data)
> y <- 1:8
> sex <-c(rep(’male’,4),rep(’female’,4))
> treat <-rep(c(’A’,’B’),4)
> sex
[1] "male" "male" "male" "male" "female" "female" "female" "female"
> treat
[1] "A" "B" "A" "B" "A" "B" "A" "B"
> tapply(y, sex, mean)
female male
6.5 2.5
> tapply(y, treat, mean)
A B
4 5
> tapply(y, list(sex,treat), mean)
A B
female 6 7
male 2 3
Subsetting data frame in R
The following is taken from “Statistical Computing and Graphics Course Notes” by Frank E. Harrell. http://cran.r-project.org/doc/contrib/Harrell-statcomp-notes.pdf
> # Subset a simple vector
> x1 <- 1:4
> sex <- rep(c(’male’,’female’),2)
> subset(x1, sex==’male’)
[1] 1 3
> # Subset a data frame
> d <- data.frame(x1=x1, x2=(1:4)/10, x3=(11:14), sex=sex)
> d
x1 x2 x3 sex
1 1 0.1 11 male
2 2 0.2 12female
3 3 0.3 13 male
4 4 0.4 14 female
> subset(d, sex==’male’)
x1 x2 x3 sex
1 1 0.1 11 male
3 3 0.3 13 male
> subset(d, sex==’male’ & x2>0.2)
x1 x2 x3 sex
3 3 0.3 13 male
> subset(d, x1>1, select=-x1)
x2 x3 sex
2 0.2 12 female
3 0.3 13 male
4 0.4 14 female
> subset(d, select=c(x1,sex))
x1 sex
1 1 male
2 2 female
3 3 male
4 4 female
> subset(d, x2<0.3, select=x2:sex)
x2 x3 sex
1 0.1 11 male
2 0.2 12 female
> subset(d, x2<0.3, -(x3:sex))
x1 x2
1 1 0.1
2 2 0.2
> attach(subset(d, sex==’male’ & x3==11, x1:x3))
> # Subset a simple vector
> x1 <- 1:4
> sex <- rep(c(’male’,’female’),2)
> subset(x1, sex==’male’)
[1] 1 3
> # Subset a data frame
> d <- data.frame(x1=x1, x2=(1:4)/10, x3=(11:14), sex=sex)
> d
x1 x2 x3 sex
1 1 0.1 11 male
2 2 0.2 12female
3 3 0.3 13 male
4 4 0.4 14 female
> subset(d, sex==’male’)
x1 x2 x3 sex
1 1 0.1 11 male
3 3 0.3 13 male
> subset(d, sex==’male’ & x2>0.2)
x1 x2 x3 sex
3 3 0.3 13 male
> subset(d, x1>1, select=-x1)
x2 x3 sex
2 0.2 12 female
3 0.3 13 male
4 0.4 14 female
> subset(d, select=c(x1,sex))
x1 sex
1 1 male
2 2 female
3 3 male
4 4 female
> subset(d, x2<0.3, select=x2:sex)
x2 x3 sex
1 0.1 11 male
2 0.2 12 female
> subset(d, x2<0.3, -(x3:sex))
x1 x2
1 1 0.1
2 2 0.2
> attach(subset(d, sex==’male’ & x3==11, x1:x3))
Friday, April 2, 2010
Text Processing using LaTeX
Learning LaTex from Cambridge at http://www-h.eng.cam.ac.uk/help/tpl/textprocessing/LaTeX_intro.html
R: snow Simplified
The 'snow' package provides a high-level interface for using a workstation cluster for parallel computations in R. Here is a friendly user guide, made by Sigal Blay (sblaysfu.ca).
http://www.sfu.ca/~sblay/R/snow.html
the following is the txt file:
snow Simplified
~~~~~~~~~~~~~~~
Many computationally demanding statistical procedures, such
as Bootstrapping and Markov Chain Monte Carlo, can be speeded
up significantly by using several connected computers in
parallel.
The package snow (an acronym for Simple Network Of
Workstations) provides a high-level interface for using a
workstation* cluster** for parallel computations*** in R.
snow relies on the Master / Slave model of communication in
which one device or process (known as the master) controls
one or more other devices or processes (known as slaves).
In Los Angeles, officials pointed out that such terms as
"master" and "slave" are unacceptable and offensive, and
equipment manufacturers were asked not to use these terms.
Some manufacturers changed the wording to primary / secondary
(Source: CNN).
snow Simplified is an adaptation of an article by Anthony
Rossini, Luke Tierney and Na Li, 'Simple parallel statistical
computing in R'.
It has been made by Sigal Blay ( http://www.sfu.ca:/~sblay ).
This work has been made possible by the Statistical Genetics
Working Group at the Department of Statistics and Actuarial
Science, SFU.
-----------------------------------------------------------------
snow implements an interface to three different low level
mechanisms for creating a virtual connection between processes:
Socket
PVM (Parallel Virtual Machine)
MPI (Message Passing Interface)
--- Starting and Stopping clusters ---
The way to Initialize slave R processes depends on your system
configuration. If MPI is installed and there are two computation
nodes, use:
> cl <- makeCluster(2, type = "MPI")
Shut down the cluster and clean up any remaining connections
between machines:
> stopCluster(cl)
--- clusterCall(cl, fun, ...) ---
cl: the computer cluster created with makeCluster.
fun: the function to be applied.
clusterCall calls a specified function with identical arguments
on each node in the cluster.
The second argument is a function object.
The arguments to clusterCall are evaluated on the master,
their values transmitted to the slave nodes which execute the
function call.
> myfunc <- function(x=2){x+1}
> myfunc_argument <- 5
> clusterCall(cl, myfunc, myfunc_argument)
is the same as
> clusterCall(cl, function(x=2){x+1}, 5)
and the result:
[[1]]
[1] 6
[[2]]
[1] 6
--- clusterEvalQ(cl, expr) ---
cl: the computer cluster created with makeCluster.
expr: an expression, typically a function call.
clusterEvalQ evaluates a literal expression on each cluster node.
'expr' is treated on the master as a character string.
The expression is evaluated on the slave nodes.
Finding the host name for each cluster node:
> clusterEvalQ(cl, Sys.getenv("HOST"))
[[1]]
HOST
"stat-db"
[[2]]
HOST
"stat-db1"
Loading the boot package on all cluster nodes:
> clusterEvalQ(cl, library(boot))
clusterEvalQ is the cluster version of the R function 'evalq'.
--- clusterApply(cl, seq, fun, ...) ---
clusterApply takes a cluster, a sequence of arguments (can be a
vector or a list), and a function, and calls the function with
the first element of the list on the first node, with the second
element of the list on the second node, and so on. The list of
arguments must have at most as many elements as there are nodes
in the cluster.
> clusterApply(cl, 1:2, sum, 3)
[[1]]
[1] 4
[[2]]
[1] 5
--- clusterApplyLB(cl, seq, fun, ...) ---
clusterApplyLB is a load balancing version of clusterApply.
It hands in a balanced work load to slave nodes when
the length of seq is greater than the number of cluster nodes.
Doesn`t work if cluster Type=Socket.
> length(cl)
[1] 2
> clusterApplyLB(cl, 1:3, sum, 3)
[[1]]
[1] 4
[[2]]
[1] 5
[[3]]
[1] 6
Using clusterApplyLB can result in better cluster utilization
than using clusterApply. However, increasing communication can
reduce performance. Furthermore, the node that executes a
particular job is nondeterministic, which can complicate
ensuring reproducibility in simulations.
--- clusterExport(cl, list) ---
Assigns the global values on the master of the variables named in
list to variables of the same names in the global environments of
each node.
> myvar <- 666
> clusterExport(cl, "myvar")
--- clusterSplit(cl, seq) ---
clusterSplit splits seq into one consecutive piece for each
cluster and returns the result as a list with length equal to the
number of cluster nodes. The pieces are chosen to be close to
equal in length.
> clusterSplit(cl, 1:6)
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5 6
--- parApply(cl, X, MARGIN, fun, ...) ---
X: the array to be used.
MARGIN: a vector giving the subscripts which the function will be
applied over. '1' indicates rows, '2' indicates columns,
'c(1,2)' indicates rows and columns.
fun: the function to be applied.
parApply is the parallel version of `the R function apply.
A<-matrix(1:10, 5, 2)
> A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> parApply(cl, A, 1, sum)
[1] 7 9 11 13 15
--- parRapply(cl, X, fun, ...) ---
row parallel 'apply' function for matrix X.
May be slightly more efficient than parApply.
--- parCapply(cl, X, fun, ...) ---
Column parallel 'apply' function for matrix X.
May be slightly more efficient than parApply.
--- parLapply(cl, X, fun, ...) ---
parLapply is the parallel version of the R function 'lapply'.
'lapply' returns a list of the same length as 'X'. Each element
of which is the result of applying 'fun' to the corresponding
element of 'X'.
Create a list of two sequences of numbers:
> x <- list(alpha = 1:10, beta = exp(-3:3))
> x
$alpha
[1] 1 2 3 4 5 6 7 8 9 10
$beta
[1] 0.04978707 0.13533528 0.36787944 1.00000000 2.71828183
[7] 7.38905610 20.08553692
Calculate quantiles for each sequence - parLapply returns a list:
> parLapply(cl, x, quantile)
$alpha
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
$beta
0% 25% 50% 75% 100%
0.04978707 0.25160736 1.00000000 5.05366896 20.08553692
--- parSapply(cl, X, fun, ..., simplify=TRUE, USE.NAMES=TRUE) ---
parSapply is the parallel version of `the R function 'sapply'.
'sapply' is a "user-friendly" version of 'lapply'.
It returns a vector or matrix with 'dimnames' if appropriate.
Calculate quantiles for each sequence - parSapply returns a matrix:
> parSapply(cl, x, quantile)
alpha beta
0% 1.00 0.04978707
25% 3.25 0.25160736
50% 5.50 1.00000000
75% 7.75 5.05366896
100% 10.00 20.08553692
--- parMM(cl, A, B) ---
parMM performs parallel matrix multiplication.
A, B are the matices to be multiplied.
--- How much faster is using snow? ---
To compare performace, use the command system.time().
Here is an example:
Create a matrix of random normaly distributed numbers:
> A <- matrix(rnorm(1000000), 1000)
Check CPU time for unparallel matrix multiplication:
> system.time(A %*% A)
[1] 5.33 0.02 5.36 0.00 0.00
elapesed time was 5.36 seconds
... And for parallel matrix multiplication:
> system.time(parMM(cl, A, A))
[1] 2.49 2.09 9.16 0.00 0.00
elapsed time was 9.16 seconds.
Even in dedicated clusters with high speed networking,
communication is orders of magnitude slower than computation.
In order to reduce communication costs, eliminate unnecessary data
transfer.
--- Random Number Generation ---
The default random number generators are likely to be correlated:
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.4960831 0.5095497 0.2971236
[[2]]
[1] 0.4960831 0.5095497 0.2971236
A way of addressing this is random seeding:
clusterApply(cl, runif(cl, max=100000), set.seed)
References:
Rossini, A., Tierney, L., and Li, N. (2003). Simple parallel
statistical computing in R. UW Biostatistics working paper series,
Paper 193, University of Washington.
http://www.bepress.com/uwbiostat/paper193.
Tierney, L., Rossini, A., Li, N., and Sevcikova, H. (2004).
The snow Package: Simple Network of Workstations. Version 0.2-1.
http://cran.r-project.org/doc/packages/snow.pdf
--------------------------------------------------------------------
Workstation - A workstation is a computer intended for individual
use that is faster and more capable than a personal computer.
i.e. have a faster microprocessor, a large amount of random access
memory (RAM), and special features such as high-speed graphics
adapters.
Source: whatis.com
http://whatis.techtarget.com/
Cluster - In a computer system, a cluster is a group of servers
and other resources that act like a single system and enable high
availability and, in some cases, load balancing and parallel
processing.
Source: whatis.com
http://whatis.techtarget.com/
Parallel Computing / Parallel Processing -
The simultaneous use of more than one computer to execute a program.
Objectives: scalability - the capability to hadle larger workloads
reducing execution time
Source: bitpipe
http://www.bitpipe.com/
http://www.sfu.ca/~sblay/R/snow.html
the following is the txt file:
snow Simplified
~~~~~~~~~~~~~~~
Many computationally demanding statistical procedures, such
as Bootstrapping and Markov Chain Monte Carlo, can be speeded
up significantly by using several connected computers in
parallel.
The package snow (an acronym for Simple Network Of
Workstations) provides a high-level interface for using a
workstation* cluster** for parallel computations*** in R.
snow relies on the Master / Slave model of communication in
which one device or process (known as the master) controls
one or more other devices or processes (known as slaves).
In Los Angeles, officials pointed out that such terms as
"master" and "slave" are unacceptable and offensive, and
equipment manufacturers were asked not to use these terms.
Some manufacturers changed the wording to primary / secondary
(Source: CNN).
snow Simplified is an adaptation of an article by Anthony
Rossini, Luke Tierney and Na Li, 'Simple parallel statistical
computing in R'.
It has been made by Sigal Blay ( http://www.sfu.ca:/~sblay ).
This work has been made possible by the Statistical Genetics
Working Group at the Department of Statistics and Actuarial
Science, SFU.
-----------------------------------------------------------------
snow implements an interface to three different low level
mechanisms for creating a virtual connection between processes:
Socket
PVM (Parallel Virtual Machine)
MPI (Message Passing Interface)
--- Starting and Stopping clusters ---
The way to Initialize slave R processes depends on your system
configuration. If MPI is installed and there are two computation
nodes, use:
> cl <- makeCluster(2, type = "MPI")
Shut down the cluster and clean up any remaining connections
between machines:
> stopCluster(cl)
--- clusterCall(cl, fun, ...) ---
cl: the computer cluster created with makeCluster.
fun: the function to be applied.
clusterCall calls a specified function with identical arguments
on each node in the cluster.
The second argument is a function object.
The arguments to clusterCall are evaluated on the master,
their values transmitted to the slave nodes which execute the
function call.
> myfunc <- function(x=2){x+1}
> myfunc_argument <- 5
> clusterCall(cl, myfunc, myfunc_argument)
is the same as
> clusterCall(cl, function(x=2){x+1}, 5)
and the result:
[[1]]
[1] 6
[[2]]
[1] 6
--- clusterEvalQ(cl, expr) ---
cl: the computer cluster created with makeCluster.
expr: an expression, typically a function call.
clusterEvalQ evaluates a literal expression on each cluster node.
'expr' is treated on the master as a character string.
The expression is evaluated on the slave nodes.
Finding the host name for each cluster node:
> clusterEvalQ(cl, Sys.getenv("HOST"))
[[1]]
HOST
"stat-db"
[[2]]
HOST
"stat-db1"
Loading the boot package on all cluster nodes:
> clusterEvalQ(cl, library(boot))
clusterEvalQ is the cluster version of the R function 'evalq'.
--- clusterApply(cl, seq, fun, ...) ---
clusterApply takes a cluster, a sequence of arguments (can be a
vector or a list), and a function, and calls the function with
the first element of the list on the first node, with the second
element of the list on the second node, and so on. The list of
arguments must have at most as many elements as there are nodes
in the cluster.
> clusterApply(cl, 1:2, sum, 3)
[[1]]
[1] 4
[[2]]
[1] 5
--- clusterApplyLB(cl, seq, fun, ...) ---
clusterApplyLB is a load balancing version of clusterApply.
It hands in a balanced work load to slave nodes when
the length of seq is greater than the number of cluster nodes.
Doesn`t work if cluster Type=Socket.
> length(cl)
[1] 2
> clusterApplyLB(cl, 1:3, sum, 3)
[[1]]
[1] 4
[[2]]
[1] 5
[[3]]
[1] 6
Using clusterApplyLB can result in better cluster utilization
than using clusterApply. However, increasing communication can
reduce performance. Furthermore, the node that executes a
particular job is nondeterministic, which can complicate
ensuring reproducibility in simulations.
--- clusterExport(cl, list) ---
Assigns the global values on the master of the variables named in
list to variables of the same names in the global environments of
each node.
> myvar <- 666
> clusterExport(cl, "myvar")
--- clusterSplit(cl, seq) ---
clusterSplit splits seq into one consecutive piece for each
cluster and returns the result as a list with length equal to the
number of cluster nodes. The pieces are chosen to be close to
equal in length.
> clusterSplit(cl, 1:6)
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5 6
--- parApply(cl, X, MARGIN, fun, ...) ---
X: the array to be used.
MARGIN: a vector giving the subscripts which the function will be
applied over. '1' indicates rows, '2' indicates columns,
'c(1,2)' indicates rows and columns.
fun: the function to be applied.
parApply is the parallel version of `the R function apply.
A<-matrix(1:10, 5, 2)
> A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> parApply(cl, A, 1, sum)
[1] 7 9 11 13 15
--- parRapply(cl, X, fun, ...) ---
row parallel 'apply' function for matrix X.
May be slightly more efficient than parApply.
--- parCapply(cl, X, fun, ...) ---
Column parallel 'apply' function for matrix X.
May be slightly more efficient than parApply.
--- parLapply(cl, X, fun, ...) ---
parLapply is the parallel version of the R function 'lapply'.
'lapply' returns a list of the same length as 'X'. Each element
of which is the result of applying 'fun' to the corresponding
element of 'X'.
Create a list of two sequences of numbers:
> x <- list(alpha = 1:10, beta = exp(-3:3))
> x
$alpha
[1] 1 2 3 4 5 6 7 8 9 10
$beta
[1] 0.04978707 0.13533528 0.36787944 1.00000000 2.71828183
[7] 7.38905610 20.08553692
Calculate quantiles for each sequence - parLapply returns a list:
> parLapply(cl, x, quantile)
$alpha
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
$beta
0% 25% 50% 75% 100%
0.04978707 0.25160736 1.00000000 5.05366896 20.08553692
--- parSapply(cl, X, fun, ..., simplify=TRUE, USE.NAMES=TRUE) ---
parSapply is the parallel version of `the R function 'sapply'.
'sapply' is a "user-friendly" version of 'lapply'.
It returns a vector or matrix with 'dimnames' if appropriate.
Calculate quantiles for each sequence - parSapply returns a matrix:
> parSapply(cl, x, quantile)
alpha beta
0% 1.00 0.04978707
25% 3.25 0.25160736
50% 5.50 1.00000000
75% 7.75 5.05366896
100% 10.00 20.08553692
--- parMM(cl, A, B) ---
parMM performs parallel matrix multiplication.
A, B are the matices to be multiplied.
--- How much faster is using snow? ---
To compare performace, use the command system.time().
Here is an example:
Create a matrix of random normaly distributed numbers:
> A <- matrix(rnorm(1000000), 1000)
Check CPU time for unparallel matrix multiplication:
> system.time(A %*% A)
[1] 5.33 0.02 5.36 0.00 0.00
elapesed time was 5.36 seconds
... And for parallel matrix multiplication:
> system.time(parMM(cl, A, A))
[1] 2.49 2.09 9.16 0.00 0.00
elapsed time was 9.16 seconds.
Even in dedicated clusters with high speed networking,
communication is orders of magnitude slower than computation.
In order to reduce communication costs, eliminate unnecessary data
transfer.
--- Random Number Generation ---
The default random number generators are likely to be correlated:
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.4960831 0.5095497 0.2971236
[[2]]
[1] 0.4960831 0.5095497 0.2971236
A way of addressing this is random seeding:
clusterApply(cl, runif(cl, max=100000), set.seed)
References:
Rossini, A., Tierney, L., and Li, N. (2003). Simple parallel
statistical computing in R. UW Biostatistics working paper series,
Paper 193, University of Washington.
http://www.bepress.com/uwbiostat/paper193.
Tierney, L., Rossini, A., Li, N., and Sevcikova, H. (2004).
The snow Package: Simple Network of Workstations. Version 0.2-1.
http://cran.r-project.org/doc/packages/snow.pdf
--------------------------------------------------------------------
Workstation - A workstation is a computer intended for individual
use that is faster and more capable than a personal computer.
i.e. have a faster microprocessor, a large amount of random access
memory (RAM), and special features such as high-speed graphics
adapters.
Source: whatis.com
http://whatis.techtarget.com/
Cluster - In a computer system, a cluster is a group of servers
and other resources that act like a single system and enable high
availability and, in some cases, load balancing and parallel
processing.
Source: whatis.com
http://whatis.techtarget.com/
Parallel Computing / Parallel Processing -
The simultaneous use of more than one computer to execute a program.
Objectives: scalability - the capability to hadle larger workloads
reducing execution time
Source: bitpipe
http://www.bitpipe.com/
Matlab is Great; R is Exciting!
Preparing for the tutorial of "Introductory Econometrics with Examples in R" gets me really excited about it. R is a very powerful language (some body claim that it is only an environment, which is the case when you only use built-in functions or packages available. It actually can do more than that, much much more.) It can do whatever you can imagin, e.g. solving equations, optimization, numerical integration, monte carlo, statistical inference, Bayesian analysis, MCMC, bootstrap, EN, Sparse PCA, time series analysis, panel data, spatial analysis, production analysis, graphical analysis, animation, clustering, parrallel computing, .... It can also do many things that you cannot even imagin, such as Archetypal Analysis, fuzzy knowledge base learning, Forensic Genetics, etc. Just check the online packages at
http://cran.r-project.org/index.html
Getting excited, I've decided to write/collect one example here per day to keep practicing and learning R. Hope this will continue as a habit and attract interested readers as well.
http://cran.r-project.org/index.html
Getting excited, I've decided to write/collect one example here per day to keep practicing and learning R. Hope this will continue as a habit and attract interested readers as well.
Subscribe to:
Posts (Atom)