This tutorial follows the Handbook on Impact Evaluation: Quantitative Methods and Practices, chapter 11. The data files we will use can be downloaded from here. The first part of Chapter 11 is covered in Impact Evaluation on a Budget: World Bank Data and R.
ls()
)Here I assume you saved the file (from the previous tutorial) to the ~/eval/data
folder.
Stata:
use ~/eval/data/hh_98.dta
R:
library(foreign)
hh_98 = read.dta('~/eval/data/hh_98.dta')
(If you don’t already have the foreign
library installed, you can use the command install.packages("foreign")
.)
Stata:
describe
R:
ls(hh_98)
dim(hh_98)
sapply(hh_98,class)
The function ls(x)
displays the names of the objects within x
. If you just enter ls()
, R will show you the names of the objects open in your current environment (remember you can use ?ls
to see the R documentation for the ls()
function). The function dim(x)
returns the dimensions of object x
. When measuring a data.frame, like hh_98
, dim()
returns the number of rows first followed by the number of columns. The function sapply(x,FUN)
returns a simplified result from applying the function FUN
to each object in x
. The function class(x)
returns the class of object x
.
Stata:
describe exp∗
R:
summary(hh_98[grep("exp", colnames(hh_98))])
In R, it is possible to do things even if we don’t know the exact name of the object we want to analyze. Starting from the innermost function and working our way out, colnames(hh_98)
returns a vector where each element is the name of a column of hh_98
. grep("exp", x)
returns the indices of the elements that contain “exp” (you can also use regexp here) within x
. Placing the resulting vector of indices into hh_98[]
returns the matching columns. Finally, summary()
returns the following summary of the returned columns:
expfd expnfd exptot Min. : 945.3 Min. : 89.55 Min. : 1193 1st Qu.: 2602.1 1st Qu.: 514.37 1st Qu.: 3254 Median : 3373.7 Median : 865.31 Median : 4432 Mean : 3660.2 Mean : 1813.08 Mean : 5473 3rd Qu.: 4232.5 3rd Qu.: 1710.24 3rd Qu.: 6039 Max. :15270.7 Max. :43411.15 Max. :47981
List the first three entries in hh_98:
Stata:
list in 1/3
R:
hh_98[1:3,]
In R, you can access records in a data.frame using matrix notation. The colon (:
) separates the beginning and ending of a sequence. By leaving the portion following the comma blank, we tell R to show all columns. List household size and head’s education for households headed by a female who is younger than 45:
Stata:
list famsize educhead if (sexhead==0 && agehead<45)
R:
subset(hh_98,sexhead==0 && agehead<45,c(famsize,educhead))
The subset()
function is another method of selecting elements. Here’s the matrix form of the same subset:
R:
hh_98[hh_98$sexhead==0 && hh_98$agehead<45,c("famsize","educhead")]
Browse or Edit the data: Stata:
browse
edit
R:
View(hh_98)
edit(hh_98)
Display summary statistics for a few variables:
Stata:
sum famsize educhead
sum famsize educhead, d
R:
summary(hh_98[,c("famsize","educhead")])
library(psych)
describe(hh_98[,c("famsize","educhead")])
(If you don’t already have the foreign
library installed, you can use the command install.packages("foreign")
.) Using survey weights:
Stata:
sum famsize educhead [aw=weight]
R:
library(survey)
design <- svydesign(id=~nh,weights=~weight,data=hh_98)
svymean(~famsize + educhead,design)
(If you don’t already have the survey
library installed, you can use the command install.packages("survey")
.) Summarize by groups:
Stata:
sort dfmfd by dfmfd: sum famsize educhead [aw=weight]
tabstat famsize educhead, statistics(mean sd) by(dfmfd)
R:
library(survey)
svyby(~famsize + educhead, ~dfmfd, design, svymean)
(you only need to call library(survey)
once per session).
Stata:
tab dfmfd
R:
table(hh_98$dfmfd)
In R, the table()
function presents a table similar to the tabulate function in Stata, but only shows the counts grouped by factor. To see both the counts and percentages, as in the Stata program, we can divide by the total count (i.e., the length()
). I group the counts and percentages using a list()
so they are displayed together.
R:
list(count=table(hh_98$dfmfd),percent=table(hh_98$dfmfd)/length(hh_98$dfmfd))
Frequency tables over subsets and for multiple variables:
Stata:
tab sexhead if dfmfd==1
tab educhead sexhead
R:
table(hh_98[hh_98$dfmfd==1,]$sexhead) table(hh_98$educhead, hh_98$sexhead)
Column and row percentages:
Stata:
tab dfmfd sexhead, col row
R:
mytable <- table(hh_98$dfmfd, hh_98$sexhead)
list(counts = mytable, percent.row = prop.table(mytable,1), percent.col = prop.table(mytable,2), count.row = margin.table(mytable,1), count.col = margin.table(mytable,2))
Stata:
table dfmfd, c(mean famsize mean educhead)
R:
by(hh_98[c("famsize","educhead")], hh_98$dfmfd, colMeans)
Breakdown by two factors:
Stata:
table dfmfd sexhead, c(mean famsize mean educhead)
R:
by(hh_98[c("famsize","educhead")], hh_98[c("dfmfd","sexhead")], colMeans)
In Stata, missing values are represented by ”.
” In R, missing values are represented by ”NA
”
Stata:
count count if agehead>50
R:
dim(hh_98)[1] dim(hh_98[hh_98$agehead>50,])[1]
-or-
length(hh_98[,1])
length(hh_98[hh_98$agehead>50,1])
For information on using weights in R, take a look at the homepage for the survey
package: https://r-survey.r-forge.r-project.org/survey/
The following websites are useful for searching for R:
Remember to use ?
to look up functions and ??
to search for help within R (e.g., ?by
).
Quick Links
Legal Stuff
Social Media