Friday, April 23, 2010

R: Probit Model

Here is an example from the internet,

# Simulate from probit model --
simulate <- function(N) {x1 <- 2*runif(N)
x2 <- 5*runif(N) ystar <- 7 + 3*x1 - 4*x2 + rnorm(N)
y <- as.numeric(ystar>0)
print(table(y))
data.frame(y, x1, x2, ystar)}

> # A simple example of estimation --
> D <- simulate(200)y0 197 103
> m <- glm(y~x1+x2, family=binomial(probit), data=D)

> summary(m)
Call:
glm(formula = y ~ x1 + x2, family = binomial(probit), data = D)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.125e+00 -7.332e-04 2.107e-08 7.812e-04 2.326e+00
Coefficients:
Estimate Std. Error z value Pr(>z)(Intercept)
8.0810 1.6723 4.832 1.35e-06 ***x1 3.1877 0.6948 4.588 4.47e-06 ***x2 -4.4544 0.8488 -5.248 1.54e-07 ***
---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)
Null deviance: 277.079 on 199 degrees of freedom
Residual deviance: 40.845 on 197 degrees of freedom
AIC: 46.845

Number of Fisher Scoring iterations: 10

For this one dataset of 200 observations, we got back estimate which is quite close to the true .

Let us look at how well this model predicts in-sample:
> # Predictions using the model --
> predictions <- predict(m)
> summary(predictions)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.0800 -4.1720 0.1780 0.3705 5.1250 12.7400
> yhat <- ifelse(predictions>0, 1, 0)
> table(yhat, D$y)

It does pretty well. For 93 cases, the truth is 0 and the predicted value is0. Similarly, for 98 cases, the truth is 1 and the predicted value is 1. There are only 9 cases which are misclassified.

Tuesday, April 13, 2010

Music Gallery

Last Updated : Mar 23, 2011

Are you the one? --- Timmo Tolkki

As Love Is My Witness --- Westlife

Bad Day --- By Daniel Powter

Beauty and Beast --- Celine Dion

Because You Loved Me --- Celine Dion

Big Girls Don't Cry --- Fergie

Born to Make you Happy --- Britney Spears

Can you feel the love tonight? --- Elton John

Can't Get Over You --- September

Casablanca --- Bertie Higgins

Complicated --- Avril Lavigne

Don't Be Stupid --- Shania Twain

Don't Go Away --- BY2 (Chinese Version)

Don't Go Away --- BY2 (English Version)

Don't Impress Me Much --- Shania Twain

(Everything I Do) I Do It For You --- By Bryan Adams

Everytime --- Britney Spears

Eyes on Me --- Celine Dion

Falling for You --- Colbie Caillat

From This Moment On --- Shania Twain

Fuckin' Perfect --- Pink (P!nk)

Girlfriend --- Avril Lavigne

God Is a Girl --- Groove Coverage

Heaven Love --- Ying Wang (Chinese)

How Do I Live --- Leanne Rimes

How Have You Been --- S.H.E. (Chinese)

Hui Hu Xi de Tong --- Jinru Liang (Chinese)

I'm Your Angel --- Celine Dion and R. Kelly

I'm Yours ---- Jason Mraz

I'll be There for You ---- Rembrandts

I Love You --- S.H.E. (Chinese)

I see you --- Leona Lewis

I Surrender --- Celion Dion

I Wanna Know What Love Is --- Mariah Carey

I Will Always Love you --- Whitney Houston

If That's What Takes --- Celine Dion

It's All Coming Back To Me Now --- Celine Dion

K-Lovers --- Pin and Liang (Chinese)

Love Story --- Taylor Swrift

Loving Loving You --- Rene Liu (Chinese)

Lucky --- Jason Mraz and Colbie Caillat

Make It Possible --- By Xiaoming Huang (Chinese)

Miracle ---- Celine Dion

Missing You (Heart-Broken) --- Jie Xu (Chinese)

Missing You Suddenly --- Youjia Lin (Chinese)/Originally by Wuyuetian

My Heart Will Go On --- Celine Dion

Need You now --- Lady Antebellum

Nobody But You --- Wonder Girls

Only Love ---- By Trade Marks

Our Song --- Taylor Swift

Realize ---- By Colbie caillat

Shadows --- Westlife

She's Gone --- Steel Heart

That's the Way it is! --- Celine Dion

The Power of Love --- Celine Dion

There you'll be ----Faith Hill

To Love Your More --- Celine Dion

Truly Madly Deeply --- Savage Garden

U Make Me Wanna --- By Blue-in-one

Up --- Shania Twain

Wait For You --- Elliott Yamin

What About Now --- Westlife

When A Man Loves A Woman --- Michael Bolton

What Happen To Us --- S.H.E. (Chinese)

When I Fall in Love --- Celine Dion

When I need You --- Celine Dion

When You're Gone --- Avril Lavigne

Where We Are --- Westlife

You and I --- Celine Dion

You and I Both --- Jason Mraz

You Belong With Me --- Taylor Swift

You're not Alone --- Michael Jackson

You're Still the One! --- Shania Twain

You Raise Me Up --- Josh Groban

You've Got A Way --- Shania Twain

Monday, April 12, 2010

A New Era: A Beautiful Mind

Look at this guy, who is changing this world in a way beyond your imagination. This world is so beautiful, because of beautiful minds as such that are on the way to make it perfect. Be creative and be valuable to the world. Make a change from now on, in a better way!

The MIT Genius
http://v.youku.com/v_show/id_XMTQ0MTM5Njg0.html

Thursday, April 8, 2010

R: Linear Regression Example of Faraway (2002)

See pp.23 of Faraway (2002), Practical Regression and Anova using R at:
http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

> data(gala)
> gala
> gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
> summary(gfit)

The following is to compute some statistics by writing R code, just for illustrative purpose (since these estimates are already computed by the function lm())

> x <- cbind(1,gala[,-c(1,2)])
> y <- gala$Species
> t(x) %*% x
Error: %*% requires numeric matrix/vector arguments

> x <- as.matrix(x)
> t(x) %*% x
> xtxi <- solve(t(x) %*% x)
> xtxi

># Get it using lm()
> gfit <- lm(Species ˜ Area + Elevation + Nearest + Scruz + Adjacent,data=gala)
> gs <- summary(gfit)
> gs$cov.unscaled
> names(gs)
> names(gfit)
> gfit$fit
> gfit$res

>#Get beta_hat by
> xtxi %*% t(x) %*% y #or
> solve(t(x) %*% x, t(x) %*% y)

> sqrt(sum(gfit$resˆ2)/(30-6))
> sqrt(diag(xtxi))*60.975
> #R^2
> 1-sum(gfit$resˆ2)/sum((y-mean(y))ˆ2)

Monday, April 5, 2010

R: replacing for loop by apply

The function "apply" in R can help to save a lot computational time when it is replacing for loops. The following is just an example illustrating this point.

time1=proc.time()
I=50;B=50;x=matrix(0,nrow=I,ncol=B);
for (i in 1:I){
for (b in 1:B){ x=matrix(rnorm(2500),nrow=50); eigen(x) }
}
time1=proc.time()-time1

time2=proc.time()
I=matrix(1:50,nrow=1); B=matrix(1:50,nrow=1);
eig2<-function(x1=1){
x=matrix(rnorm(2500),nrow=50);
eigen(x);
}
eig1<-function(x2=1){apply(B,1,eig2); }
apply(I,1,eig1);time2=proc.time()-time2
time1time2

#> time1
# user system elapsed
# 20.62 0.11 20.97
#> time2
# user system elapsed
# 0.07 0.11 0.17

Sunday, April 4, 2010

A plot example

This example is taken from “Introduction to the R Project for Statistical Computing for Use at the ITC” by David Rossiter, pp.133, http://cran.r-project.org/doc/contrib/Rossiter-RIntro-ITC.pdf

n <- 30; mu <- 180; sd <- 20; bin.width <- sd/3v <- rnorm(n, mu, sd)
hist(v, breaks = seq(mu-5*sd, mu+5*sd, by=bin.width))
points(x, dnorm(x, mu, sd)*(n*bin.width))

### visualise the variability of small random samples
# sample size
n <- 30
# number of plot rows, columns
rows <- 2; cols <- 2;
reps <- rows*cols
# parameters of the normal distribution
mu <- 180; sd <- 20
# set up graphic display
par(mfrow=c(rows, cols))
# number of s.d.'s for histogram displays
dd <- 3.5
# compute bin width from s.d and
# the number of bars for each
bin.width=sd/3
# scale x-axis
x.min <- mu-(sdd*sd); x.max <- mu+(sdd*sd)
# scale y-axis
y.max <- n*0.5*bin.width/sd

# compute and display each graph
for (i in 1:reps)
{
v <- rnorm(n, mu, sd)
hist(v, xlim=c(x.min,x.max),
ylim=c(0, y.max),breaks = seq(mu-5*sd, mu+5*sd, by=bin.width),
main="", xlab=paste("Sample",i)) ;
x <- seq(x.min, x.max, length=120)
# true normal distribution
points(x,dnorm(x, mu, sd)*(n*bin.width),type="l", col="blue", lty=1, lwd=1.8)
# distribution estimated from sample
points(x,dnorm(x, mean(v), sd(v))*(n*bin.width),type="l", col="red", lty=2, lwd=1.8)
# print sample params.
# and Pr(Type I error)
text(x.min, 0.9*y.max, paste("mean:", round(mean(v),2)),pos=4)
text(x.min, 0.8*y.max, paste("sdev:", round(sd(v),2)),pos=4)
text(x.min, 0.7*y.max,paste("Pr(t):", round((t.test(v, mu=mu))$p.value,2)),pos=4)
}

# clean up
par(mfrow=c(1,1))
rm(n, rows, cols, reps, mu, sd, v, i, sdd, bin.width, x.min, x.max, y.max, x)

tapply for aggregate statistics

The use of apply function can avoid a for loop and it leads to faster computing. This example is about how to use tapply to compute some group/aggregated statistics.

Use tapply for most basic by processing (By processing = Aggregate statistics = stratified estimates; statistics computedafter cross-classifying data)

> y <- 1:8
> sex <-c(rep(’male’,4),rep(’female’,4))
> treat <-rep(c(’A’,’B’),4)
> sex
[1] "male" "male" "male" "male" "female" "female" "female" "female"
> treat
[1] "A" "B" "A" "B" "A" "B" "A" "B"
> tapply(y, sex, mean)
female male
6.5 2.5
> tapply(y, treat, mean)
A B
4 5
> tapply(y, list(sex,treat), mean)
A B
female 6 7
male 2 3

Subsetting data frame in R

The following is taken from “Statistical Computing and Graphics Course Notes” by Frank E. Harrell. http://cran.r-project.org/doc/contrib/Harrell-statcomp-notes.pdf

> # Subset a simple vector
> x1 <- 1:4
> sex <- rep(c(’male’,’female’),2)
> subset(x1, sex==’male’)
[1] 1 3

> # Subset a data frame
> d <- data.frame(x1=x1, x2=(1:4)/10, x3=(11:14), sex=sex)
> d
x1 x2 x3 sex
1 1 0.1 11 male
2 2 0.2 12female
3 3 0.3 13 male
4 4 0.4 14 female

> subset(d, sex==’male’)
x1 x2 x3 sex
1 1 0.1 11 male
3 3 0.3 13 male

> subset(d, sex==’male’ & x2>0.2)
x1 x2 x3 sex
3 3 0.3 13 male

> subset(d, x1>1, select=-x1)
x2 x3 sex
2 0.2 12 female
3 0.3 13 male
4 0.4 14 female

> subset(d, select=c(x1,sex))
x1 sex
1 1 male
2 2 female
3 3 male
4 4 female

> subset(d, x2<0.3, select=x2:sex)
x2 x3 sex
1 0.1 11 male
2 0.2 12 female

> subset(d, x2<0.3, -(x3:sex))
x1 x2
1 1 0.1
2 2 0.2

> attach(subset(d, sex==’male’ & x3==11, x1:x3))

Friday, April 2, 2010

Text Processing using LaTeX

Learning LaTex from Cambridge at http://www-h.eng.cam.ac.uk/help/tpl/textprocessing/LaTeX_intro.html

R: snow Simplified

The 'snow' package provides a high-level interface for using a workstation cluster for parallel computations in R. Here is a friendly user guide, made by Sigal Blay (sblaysfu.ca).

http://www.sfu.ca/~sblay/R/snow.html

the following is the txt file:

snow Simplified
~~~~~~~~~~~~~~~
Many computationally demanding statistical procedures, such
as Bootstrapping and Markov Chain Monte Carlo, can be speeded
up significantly by using several connected computers in
parallel.
The package snow (an acronym for Simple Network Of
Workstations) provides a high-level interface for using a
workstation* cluster** for parallel computations*** in R.
snow relies on the Master / Slave model of communication in
which one device or process (known as the master) controls
one or more other devices or processes (known as slaves).
In Los Angeles, officials pointed out that such terms as
"master" and "slave" are unacceptable and offensive, and
equipment manufacturers were asked not to use these terms.
Some manufacturers changed the wording to primary / secondary
(Source: CNN).
snow Simplified is an adaptation of an article by Anthony
Rossini, Luke Tierney and Na Li, 'Simple parallel statistical
computing in R'.
It has been made by Sigal Blay ( http://www.sfu.ca:/~sblay ).
This work has been made possible by the Statistical Genetics
Working Group at the Department of Statistics and Actuarial
Science, SFU.
-----------------------------------------------------------------
snow implements an interface to three different low level
mechanisms for creating a virtual connection between processes:
Socket
PVM (Parallel Virtual Machine)
MPI (Message Passing Interface)
--- Starting and Stopping clusters ---
The way to Initialize slave R processes depends on your system
configuration. If MPI is installed and there are two computation
nodes, use:
> cl <- makeCluster(2, type = "MPI")
Shut down the cluster and clean up any remaining connections
between machines:
> stopCluster(cl)
--- clusterCall(cl, fun, ...) ---
cl: the computer cluster created with makeCluster.
fun: the function to be applied.
clusterCall calls a specified function with identical arguments
on each node in the cluster.
The second argument is a function object.
The arguments to clusterCall are evaluated on the master,
their values transmitted to the slave nodes which execute the
function call.
> myfunc <- function(x=2){x+1}
> myfunc_argument <- 5
> clusterCall(cl, myfunc, myfunc_argument)
is the same as
> clusterCall(cl, function(x=2){x+1}, 5)
and the result:
[[1]]
[1] 6
[[2]]
[1] 6
--- clusterEvalQ(cl, expr) ---
cl: the computer cluster created with makeCluster.
expr: an expression, typically a function call.
clusterEvalQ evaluates a literal expression on each cluster node.
'expr' is treated on the master as a character string.
The expression is evaluated on the slave nodes.
Finding the host name for each cluster node:
> clusterEvalQ(cl, Sys.getenv("HOST"))
[[1]]
HOST
"stat-db"
[[2]]
HOST
"stat-db1"
Loading the boot package on all cluster nodes:
> clusterEvalQ(cl, library(boot))
clusterEvalQ is the cluster version of the R function 'evalq'.
--- clusterApply(cl, seq, fun, ...) ---
clusterApply takes a cluster, a sequence of arguments (can be a
vector or a list), and a function, and calls the function with
the first element of the list on the first node, with the second
element of the list on the second node, and so on. The list of
arguments must have at most as many elements as there are nodes
in the cluster.
> clusterApply(cl, 1:2, sum, 3)
[[1]]
[1] 4
[[2]]
[1] 5
--- clusterApplyLB(cl, seq, fun, ...) ---
clusterApplyLB is a load balancing version of clusterApply.
It hands in a balanced work load to slave nodes when
the length of seq is greater than the number of cluster nodes.
Doesn`t work if cluster Type=Socket.
> length(cl)
[1] 2
> clusterApplyLB(cl, 1:3, sum, 3)
[[1]]
[1] 4
[[2]]
[1] 5
[[3]]
[1] 6
Using clusterApplyLB can result in better cluster utilization
than using clusterApply. However, increasing communication can
reduce performance. Furthermore, the node that executes a
particular job is nondeterministic, which can complicate
ensuring reproducibility in simulations.
--- clusterExport(cl, list) ---
Assigns the global values on the master of the variables named in
list to variables of the same names in the global environments of
each node.
> myvar <- 666
> clusterExport(cl, "myvar")
--- clusterSplit(cl, seq) ---
clusterSplit splits seq into one consecutive piece for each
cluster and returns the result as a list with length equal to the
number of cluster nodes. The pieces are chosen to be close to
equal in length.
> clusterSplit(cl, 1:6)
[[1]]
[1] 1 2 3
[[2]]
[1] 4 5 6
--- parApply(cl, X, MARGIN, fun, ...) ---
X: the array to be used.
MARGIN: a vector giving the subscripts which the function will be
applied over. '1' indicates rows, '2' indicates columns,
'c(1,2)' indicates rows and columns.
fun: the function to be applied.
parApply is the parallel version of `the R function apply.
A<-matrix(1:10, 5, 2)
> A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> parApply(cl, A, 1, sum)
[1] 7 9 11 13 15
--- parRapply(cl, X, fun, ...) ---
row parallel 'apply' function for matrix X.
May be slightly more efficient than parApply.
--- parCapply(cl, X, fun, ...) ---
Column parallel 'apply' function for matrix X.
May be slightly more efficient than parApply.
--- parLapply(cl, X, fun, ...) ---
parLapply is the parallel version of the R function 'lapply'.
'lapply' returns a list of the same length as 'X'. Each element
of which is the result of applying 'fun' to the corresponding
element of 'X'.
Create a list of two sequences of numbers:
> x <- list(alpha = 1:10, beta = exp(-3:3))
> x
$alpha
[1] 1 2 3 4 5 6 7 8 9 10
$beta
[1] 0.04978707 0.13533528 0.36787944 1.00000000 2.71828183
[7] 7.38905610 20.08553692
Calculate quantiles for each sequence - parLapply returns a list:
> parLapply(cl, x, quantile)
$alpha
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
$beta
0% 25% 50% 75% 100%
0.04978707 0.25160736 1.00000000 5.05366896 20.08553692
--- parSapply(cl, X, fun, ..., simplify=TRUE, USE.NAMES=TRUE) ---
parSapply is the parallel version of `the R function 'sapply'.
'sapply' is a "user-friendly" version of 'lapply'.
It returns a vector or matrix with 'dimnames' if appropriate.
Calculate quantiles for each sequence - parSapply returns a matrix:
> parSapply(cl, x, quantile)
alpha beta
0% 1.00 0.04978707
25% 3.25 0.25160736
50% 5.50 1.00000000
75% 7.75 5.05366896
100% 10.00 20.08553692
--- parMM(cl, A, B) ---
parMM performs parallel matrix multiplication.
A, B are the matices to be multiplied.
--- How much faster is using snow? ---
To compare performace, use the command system.time().
Here is an example:
Create a matrix of random normaly distributed numbers:
> A <- matrix(rnorm(1000000), 1000)
Check CPU time for unparallel matrix multiplication:
> system.time(A %*% A)
[1] 5.33 0.02 5.36 0.00 0.00
elapesed time was 5.36 seconds
... And for parallel matrix multiplication:
> system.time(parMM(cl, A, A))
[1] 2.49 2.09 9.16 0.00 0.00
elapsed time was 9.16 seconds.
Even in dedicated clusters with high speed networking,
communication is orders of magnitude slower than computation.
In order to reduce communication costs, eliminate unnecessary data
transfer.
--- Random Number Generation ---
The default random number generators are likely to be correlated:
> clusterCall(cl, runif, 3)
[[1]]
[1] 0.4960831 0.5095497 0.2971236
[[2]]
[1] 0.4960831 0.5095497 0.2971236
A way of addressing this is random seeding:
clusterApply(cl, runif(cl, max=100000), set.seed)
References:
Rossini, A., Tierney, L., and Li, N. (2003). Simple parallel
statistical computing in R. UW Biostatistics working paper series,
Paper 193, University of Washington.
http://www.bepress.com/uwbiostat/paper193.
Tierney, L., Rossini, A., Li, N., and Sevcikova, H. (2004).
The snow Package: Simple Network of Workstations. Version 0.2-1.
http://cran.r-project.org/doc/packages/snow.pdf
--------------------------------------------------------------------
Workstation - A workstation is a computer intended for individual
use that is faster and more capable than a personal computer.
i.e. have a faster microprocessor, a large amount of random access
memory (RAM), and special features such as high-speed graphics
adapters.
Source: whatis.com
http://whatis.techtarget.com/
Cluster - In a computer system, a cluster is a group of servers
and other resources that act like a single system and enable high
availability and, in some cases, load balancing and parallel
processing.
Source: whatis.com
http://whatis.techtarget.com/
Parallel Computing / Parallel Processing -
The simultaneous use of more than one computer to execute a program.
Objectives: scalability - the capability to hadle larger workloads
reducing execution time
Source: bitpipe
http://www.bitpipe.com/

Matlab is Great; R is Exciting!

Preparing for the tutorial of "Introductory Econometrics with Examples in R" gets me really excited about it. R is a very powerful language (some body claim that it is only an environment, which is the case when you only use built-in functions or packages available. It actually can do more than that, much much more.) It can do whatever you can imagin, e.g. solving equations, optimization, numerical integration, monte carlo, statistical inference, Bayesian analysis, MCMC, bootstrap, EN, Sparse PCA, time series analysis, panel data, spatial analysis, production analysis, graphical analysis, animation, clustering, parrallel computing, .... It can also do many things that you cannot even imagin, such as Archetypal Analysis, fuzzy knowledge base learning, Forensic Genetics, etc. Just check the online packages at
http://cran.r-project.org/index.html

Getting excited, I've decided to write/collect one example here per day to keep practicing and learning R. Hope this will continue as a habit and attract interested readers as well.