R is a rather flexible programming language. As opposed to, say, Stata, it allows you to easily implement methods, approaches and procedures that deviate from standard ones and are tailored to suit your specific needs.
This script is meant to give an introduction to R and to collect useful commands for different programming tasks in one place. As such, it can be used as a lookup-table. Even people experienced with a programming language regularly use google or specific codes previously written as part of their coding routine, even more so when they haven’t used a programming language in a while or when switching back and forth between several languages. Not surprisingly, then, you won’t memorize much when going through this tutorial. It is only through experience that one can learn a programming language well. However, it is important to be acquainted with some basics before starting to facilitate the first steps.
Each of the following sections is self-contained. You’ll get most out of the tutorial by writing the commands you see here by yourself on your computer, running them, having a look at the results, and implementing simple changes to the code to see how the results change.
You can install R at https://cran.r-project.org. Then, install R Studio at https://posit.co. R Studio is an interface that facilitates working with the programming language R.
R-code is written in a script, located in R Studio by default in the top-left window. This script (or parts thereof) can be run to produce output, and the script can be stored for later re-use. The immediate output is shown in the console window below, while objects you created (variables, matrices, plots, functions, etc.) are shown in the workspace/environment window on the top-right. The window on the bottom-right serves several purposes; it displays plots, contains a list of installed packages (and can be used to install new packages), and it shows the documentation page when help about a command is requested.
Include these two commands at the beginning of your script. The first deletes everything in workspace/environment, the second everything in the console.
rm(list = ls()) # delete everything in workspace/environment
cat("\f") # delete everything in console
To run a chunk of code, select it and press “cmd+enter” on a Mac or click on the “Run” button on top of the script-window.
Comments are chunks of code (or text) that R ignores when running. They start with a hashtag. To comment-out a chunk of code or text, select it and press “cmd+shift+c” on a Mac
To request help with a command, type e.g. “?sqrt”.
To create an object (a variable) called “a” that is equal to three, use one of the following three commands:
= 3
a <- 3
a assign("a", 3)
In the environment window, you can see that a variable was created. To display it in the console, type
show(a) #or:
## [1] 3
print(a)
## [1] 3
To delete it, type
rm(a) # rm stands for remove
Just as with numbers, you can perform operations using variables:
3 + 8
7^2
= 3
a = 8
b + b # displays the result in console
a = a + b # saves the result to a new variable, 'c' c
Many standardly used functions are already implemented by default in R. For example:
sqrt(a) # square-root
log(a) # natural log (ln)
exp(a) # exponential
# trigonometric functions:
cos(a)
sin(a)
tan(a)
ceiling(3.4) # returns 4, the smallest integer larger than 3.4
floor(3.4) # returns 3, the largest integer smaller than 3.4
%%a #computes the modulus; remainder of b/a
b
round(342.347, digits = 2) # rounds the number to two digits
# note that one can combine functions:
sqrt(ceiling(8.9))
A variable can also take on the value “Inf” or “-Inf” (positive or negative infinity). Any further operation with this variable gives again “Inf” or “-Inf”. For example,
= Inf
a
sqrt(a) # gives Inf
-Inf + 4 # gives -Inf
A variable can also take on the value “NA” (not-a-number). (This happens, for example, if one loads an excel file with an empty cell into R, as the value in such a cell needs to be distinguished from, say, zero.) Again, any operation with an NA returns NA:
= NA
a
sqrt(a) # gives NA
NA + 4 # gives NA
R is an open-source language, which means that there are many developers who write their own packages, which contain specific commands that are not included in R by default. To highlight which commands belong to which packages, we will load the packages on-the-go. However, in actual applications, it is best practice to have an initial section where you load all packages. To load a package, it first needs to be installed. You can do so by using the “Packages” panel in bottom-right window, or by typing e.g.
install.packages("ggplot2")
To load the package then, we type
library(ggplot2)
R uses different “types” to store the objects you create. Numbers like the ones created above are of type “double”:
= 3
a = 8
b
typeof(a) # verifies type of variable/object 'a'
## [1] "double"
Logicals are a type of objects that can take on the values “TRUE”, “FALSE” or “NA”. They indicate whether some condition is true or false (or cannot even be evaluated) and are (usually) created using logical operators:
= a > b
isAlargerB show(isAlargerB)
## [1] FALSE
# other logical operators:
< b
a >= b
a <= b
a == b # equal to
a != b # not equal to
a
< Inf # comparisons with Inf work
a < NA # but comparisons with NA yield NA a
Logicals can be used in standard computations, whereby “TRUE” indicates a 1 and “FALSE” indicates a 0:
TRUE + 3 # gives 4
TRUE == 1 # gives TRUE
Even though the above shows that 1 is equal to TRUE, the “double”-object 1 and the “logical”-object TRUE are not the same thing because their type differs, i.e. R stores them differently:
typeof(TRUE)
## [1] "logical"
typeof(1)
## [1] "double"
# Check whether TRUE and 1 are logicals: (can check analogously whether an
# object is a 'double')
is.logical(TRUE)
is.logical(1)
Oftentimes (as here), having objects that do pretty much the same thing but are of different types is innocuous. However, it is useful to be aware that there are different types that R uses to store objects because sometimes a command only works for a specific type (or gives different results for different types), in which case one needs to convert the object to the right type.
Based on existing logicals (i.e. variables indicating whether a condition is true or false), further logicals can be created that combine these underlying conditions using special operators. For example:
= TRUE
isXtrue = FALSE
isYtrue
# negation of a logical: turns TRUE into FALSE (and FALSE into TRUE)
= !isXtrue
isXnottrue
# gives true only if both are true:
= isXtrue & isYtrue
areBothTrue # gives TRUE if only one of them is true and also if both are true:
= isXtrue | isYtrue
isOneTrue # gives TRUE only if onle one of them is true (it gives FALSE if both are true)
= xor(isXtrue, isYtrue) isExactlyOneTrue
Up to now, we used functions that are pre-programmed in R or contained in a package we load. The user can of course also define his or her own functions. For example:
= function(x, y) {
fMultiplyXY * y
x
}
fMultiplyXY(3, 4)
## [1] 12
When a function uses many arguments, it can be hard to remember the argument order. In that case, it is useful to specify the argument names when calling the function, because then the order becomes irrelevant:
# These give both the same result:
fMultiplyXY(x = 3, y = 4)
fMultiplyXY(y = 4, x = 3)
One can also specify some arguments to be optional. For example:
# Here, the third argument ('z') is optional. By default, the function takes
# z=2.
= function(x, y, z = 2) {
fMultiplyXYZ * y * z
x
}
fMultiplyXYZ(2, 3)
There are many ways to create a vector in R:
= c(1, 2, 3) #or:
vx = 1:3
vx # by default, this creates a vertical vector (3x1 in these examples), (this is
# not apparent when displaying the vector, but you realize that when performing
# matrix algebra)
show(vx)
## [1] 1 2 3
# to transpose it, type:
t(vx)
## [,1] [,2] [,3]
## [1,] 1 2 3
= c(1:3)
vx = c(3, 2, 5, 1)
vy # create new vector by stacking two vectors:
c(vx, vy)
seq(-4, 4, by = 0.8) # a vector going from -4 to 4 in increments of 0.8
seq(-4, 4, length.out = 5) # a vector going from -4 to 4 with 5 elements
rep(1, 5) # (5x1)-vector of ones
rep(vx, times = 2) # create new vector by taking vector 'vx' two times
rep(vx, each = 2) # create new vector by taking each element of 'vx' two times
rev(vx) # reverse order of elements in vx
= c("a", "b") # vector of strings
vx = c(TRUE, FALSE) # vector of logicals vx
We can perform standard algebraic operations using vectors:
= c(1, 2, 3)
vx = c(1, 3, 3)
vy
+ vy
vx - vy
vx
/3
vy
* vy # element-wise multiplication
vx %*% t(vy) # matrix multiplication vx
We can also use vectors to conduct comparisons:
# element-wise comparison, compares each element in vx to 2, returns vector of
# logicals:
>= 2
vx # element-wise comparison, compares each element in vx to corresponding element
# in vy, again returning a vector of logicals:
>= vy
vx # overall comparison; are the two vectors the same? return single logical
identical(vx, vy)
Many of the functions we applied to scalars can also be applied to vectors. This is not surprising, as R treats a number as a 1x1-vector (type “is.vector(3)” to see). When applied to a vector, these functions manipulate each element individually:
sqrt(vx) # compute square root of each element in vx
exp(vx)
log(vx)
cos(vx)
There are many functions that make sense only for vectors (not numbers):
= c(1.2, 3.3, 4.2, 2.7, 5)
vx = c(3.4, 5.5, 3.2, 8.9, 0.8)
vy
length(vx) #number of elements
min(vx) #minimum
max(vx) #maximum
range(vx) #(min, max)
sum(vx) #sum of elements
prod(vx) #product of elements
mean(vx)
median(vx)
sd(vx) # standard deviation
var(vx) # variance
cor(vx, vy) #correlation of two vectors
# the following two require the package 'timeDate':
library(timeDate)
kurtosis(vx)
skewness(vx)
# computing percentiles: (makes only sense for longer vectors, as it supposes
# that the entries are draws from some underlying continuous distribution)
quantile(vx)
quantile(vx, probs = 0.95) #95th percentile
quantile(vx, seq(0, 1, 0.125))
A vector can contain NA as an element. In this case, the above functions that require all elements of a vector may return NA, but one can tell the function to ignore the NA-entries:
= c(1.2, NA, 4.2, 2.7, 5)
vx
mean(vx) # returns NA
mean(vx, na.rm = TRUE) # returns mean of remaining elements, ignoring NA-entries
We can access a subset of a vector. This is referred to as “indexing”, and there are different ways to do so.
= c(1, 5, 7)
vx
2] #second element
vx[-2] #all but second element
vx[c(1, 3)] #first and third element
vx[-c(1, 3)] #all but first and third element
vx[c(FALSE, TRUE, FALSE)] #second element
vx[
>= 2] # all elements larger or equal to 2
vx[vx
# One can also name the entries of a vector and access them based on these
# names:
names(vx) = c("a", "b", "c") # name vector entries
"a"]
vx[c("a", "c")] vx[
Using indexing, we can change specific entries of a vector, leaving the rest untouched:
2] = 4
vx[c(2, 3)] = c(3, 9) vx[
Here are some more commands for vectors:
= c(2, 1, 3, 2, 5)
vx = c(2, 3, 4, 1, 5)
vy
# Sorting & Ordering:
sort(vx) # sorts in increasing order
sort(vx, decreasing = TRUE) #sorts in decreasing order
order(vx) #returns indices of sorted vector
order(vx, decreasing = TRUE)
# note: with strings, sort and order work alphabetically
# Finding elements that satisfy some condition:
which(vx == 2) #returns the indices of all entries equal to 2
# 'findlast' (i.e. find largest index (last element) that matches condition):
max(which(vx == 2))
which.max(vx) # returns index of maximum
which.min(vx) # returns index of minimum
# tells whether elements from vy are in vx, i.e. returns a vector of logicals
# of the length of vy:
%in% vx
vy # finds location of elements from vy in vx: (returns NA if an element does not
# exist in vx)
match(vy, vx)
# Set operations:
# (note that a vector can also be interpreted as a set of elements (numbers))
union(vx, vy)
intersect(vx, vy)
= c(2, 8, 4, 7, 5)
vz Reduce(intersect, list(vx, vy, vz)) #intersection of three sets (vectors)
setdiff(vx, vy) # 'vx complement vy'; all elements in vx that are not in vy
# Interpolation:
library(zoo)
= c(2, NA, 3, 2, 5)
vx na.approx(vx) #imputes NA values in vx by linear interpolation
= c(1, 2, 3)
vx = c(1, 4, 9)
vy approx(vx, vy, xout = 2.5) # linear interpolation of a function y = f(x) at x=2.5,
# based on 'observations' (1,1), (2,4), (3,9)
CAUTION! Often, R does not prevent you from performing operations with two vectors of different length, but it recycles the shorter one so that they match in length:
= c(1, 3, 4, 2)
vx = c(1, 3, 5)
vy
+ vy vx
## [1] 2 6 9 3
== vy vx
## [1] TRUE TRUE FALSE FALSE
# However, some (smart) functions do throw an error when applied to vectors of
# different length, e.g. cor(vx,vy)
Creating a matrix:
# create a matrix by filling-in elements of a vector by column...:
matrix(1:6, ncol = 3, nrow = 2)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix(1:6, ncol = 3, nrow = 2, byrow = TRUE) #... or by row
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
= matrix(NA, ncol = 3, nrow = 2) # matrix of NAs mA
diag(3) # 3x3 identity matrix
diag(c(3, 2)) # 2x2 diagonal matrix with 3 and 2 on diagonal
# but note: when applied to a symmetric matrix mA, diag(mA) returns its
# diagonal:
= matrix(1:9, ncol = 3, nrow = 3)
mA diag(mA)
# create a matrix by putting the same vector in each column:
= c(1, 5, 3, 6)
vx = replicate(5, vx)
mA
# create a matrix as the Cartesian product of two vectors:
= seq(0, 0.1, 0.02)
vx = seq(0.1, 0.3, 0.1)
vy = expand.grid(column1 = vx, column2 = vy) mB
To display the matrix, you can use the commands “show” and “print”, as before. Also, you can click on the matrix in the environment window, which opens up a separate panel in the script-window to show you the matrix.
Indexing:
= matrix(1:9, ncol = 3, nrow = 3)
mA
dim(mA) #display number of rows and columns of matrix mA
nrow(mA)
ncol(mA)
colnames(mA) = c("a", "b", "c") #label columns of matrix
rownames(mA) = c("r1", "r2", "r3")
1, 2] # access element (1,2)
mA["r1", "b"] #same, but using column- and row-names
mA[1, ] #access first row, returns vector
mA[c(1, 2), c(1, 3)] # access specified rows and columns mA[
Matrix algebra:
= matrix(c(1, 3, 8, 8), ncol = 2, nrow = 2)
mA = matrix(c(1, 2, 7, 1), ncol = 2, nrow = 2)
mB
* mB # element-wise multiplication
mA %*% mB # matrix multiplication
mA
t(mA) # transpose
solve(mA) # inverse
library(Matrix)
rankMatrix(mA) # compute rank
eigen(mA) # compute eigenvalues and eigenvectors;
eigen(mA)$values
eigen(mA)$vectors
Many of the functions applied above to numbers and vectors can also be applied to matrices:
sqrt(mA) # compute square-root for each element of matrix
min(mA) # compute minimum out of all elements of matrix
mean(mA) # mean of all elements in matrix
>= 2 # element-wise comparison, returns matrix of logicals mA
A particularly useful function for matrices is “apply”. It allows us to apply a function designed for a vector to each column or row of a matrix separately:
= matrix(c(1, 3, 2, 8), ncol = 2, nrow = 2)
mA
# for each column (=2nd dimension) of mA, compute the mean of elements:
apply(mA, 2, mean)
# (i.e. apply to object mA, on dimension 2 (i.e. by columns), the function
# 'mean'
# One can also supply self-written function to 'apply':
= function(vx) {
fDemean - mean(vx)
vx
}apply(mA, 2, fDemean)
# or, in short:
apply(mA, 2, function(vx) {
- mean(vx)
vx })
Arrays are useful for efficiently storing data and accessing subsets of it.
# Create three-dimensional array by using a vector to 'fill it up':
= 1:12
aX dim(aX) = c(2, 3, 2)
print(aX)
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
# Create three-dimensional array of NAs:
= array(dim = c(2, 3, 2))
aY # now we can fill up this array using indexing (see below)
# Name dimensions of array:
dimnames(aX)[[1]] = c("r1", "r2")
dimnames(aX)[[2]] = c("a", "b", "c")
dimnames(aX)[[3]] = c("A", "B")
# Indexing:
1, 3, 2] aX[
## [1] 11
c(2:3), 1] aX[,
## b c
## r1 3 5
## r2 4 6
# Functions:
sqrt(aX) # compute square-root of each element
# for each row and column, compute sum along third dimension:
apply(aX, c(1, 2), sum)
## a b c
## r1 8 12 16
## r2 10 14 18
A conditional is a chunk of code that only gets evaluated if a certain condition is true:
= 5
a
if (a == 2) {
print("a is equal 2")
else if (a == 3) {
} print("a is not equal 2 but equal 3")
else {
} print("a is neither equal 2 nor equal 3")
}
## [1] "a is neither equal 2 nor equal 3"
A for loop is a chunk of code that gets executed many times, each time for a different value of a so-called “index-variable”, named “ii” in the following. It is useful if you need to repeat a procedure for, say, different countries in your data sample.
= 0
a for (ii in 1:10) {
= a + ii
a
}print(a)
## [1] 55
for (ii in c("a", "b")) {
print(ii)
}
## [1] "a"
## [1] "b"
# Note that 1:10 and ' c('a','b') ' are both vectors. We can use many other of
# the abovementioned styles of vectors in a for loop, e.g. ' seq(-2,2,by=0.5) '
A while loop keeps executing a chunk of code as long as some condition is true:
while (a < 130) {
= a + 3
a
}print(a)
## [1] 130
Useful commands for working with for- and while-loops are “next” and “break()”. The former tells R to stop executing the code for the current iteration and proceed to the next one, while the latter tells R to stop the for- or while-loop altogether.
= 0
a for (ii in 1:10) {
if (ii%%2 == 0) {
# if ii is even (modulus of ii/2 is null), go to next ii
next
}= a + ii
a
}print(a)
## [1] 25
while (a < 130) {
= a + 3
a
if (a == 88 | a == 89) {
break)()
(
}
}print(a)
## [1] 88
A string is an object that contains text. It is stored as type “character” in R. There are many commands that are specific for strings.
= "banana"
a typeof(a) # returns 'character'
nchar(a) # number of characters in string
substr(a, 1, 3) # give me first three characters in string
as.character(3) #convert the number 3 to the string '3'
format(3, nsmall = 2) #convert the number 3 to the string '3.00'
# Create a new string out of a number and a string:
paste(1, "mouse", sep = ".") #returns '1.mouse'
# this works on also on vectors, returning vector of strings:
paste(1:3, "mouse", sep = ".")
# if 'sep' is not specified, a space is put as separator:
paste("mouse=", 3)
paste(1:3, collapse = "_") # returns string '1_2_3'
# split and modify strings: e.g. split a string at points where 'i' is located,
# returns vector of strings:
strsplit("Mississippi", "i")
gsub("i", "X", "Mississippi") #replace i with X
# Strings allow for some advanced printing commands:
# prints the specified string, inserting the number at '%s':
sprintf("C(X) = [0,%s]", 3.45)
sprintf("%.0f", 1e+08) # to avoid scientific notation
cat("hello\nworld\n") # prints the string, adding line-breaks at \n
= c("Zimbabwe", "Cameroon", "Kenya", "Rwanda", "Djibouti")
vsCountries # return logical-vector saying which string-entries contain 'w':
grepl("w", vsCountries)
# return the entries themselves, i.e. the names of these countries:
grepl("w", vsCountries)] vsCountries[
Oftentimes, we use strings to denote dates (and times of the day). The package “lubridate” is useful to work with such strings:
library(lubridate)
year("2020-04-01")
month("2020-04-01")
day("2020-04-01")
ymd("2012-03-26")
# It requires the date to be in the format 'YYYY-MM-DD'.
year("01-04-2020")
# To convert another format to this format, use
= as.Date("01.04.2020", "%d.%m.%Y")
step1 # or:
= as.POSIXlt("01-04-2020", format = "%d-%m-%Y")
step1 year(step1)
# for each date, returns the preceding Monday:
= paste("2020-04-", 10:30, sep = "")
vDates cut(as.Date(vDates), "week")
1:4] # get month-abbreviations from January to April
month.abb[
# Some more useful things for working with dates will be discussed later when
# talking about time series methods.
A list is a very general object. In contrast to vectors, its elements can be of different types. For example:
# create list:
= list(1, "a", c(1:3))
lMyList
# look at third element:
3] lMyList[
## [[1]]
## [1] 1 2 3
# add element to list:
append(lMyList, 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] 1 2 3
##
## [[4]]
## [1] 3
# name the elements in list:
names(lMyList) = c("first", "second", "blabla")
Lists are useful for saving a collection of objects, like matrices or plots, into one object, rather than creating many separate ones. For example, if in each iteration of a loop a plot is created, we can save all these plots into one list using the following commands:
= list() # create empty list before loop
lMyPlots
for (ii in 1:10) {
# create plot for iteration ii, something like this comes out:
= list(1, "a", "b")
plotHere
# save plot to 'lMyPlots':
= append(lMyPlots, list(plotHere))
lMyPlots
# and assign a name to this plot:
= paste("plot", ii, sep = "_")
nameForThisPlot
if (ii == 1) {
# at the first iteration, lMyPlots is empty, so use this:
names(lMyPlots) = c(names(lMyPlots), nameForThisPlot)
else {
} # later on, we can just use this:
names(lMyPlots)[ii] = nameForThisPlot
}
}
(Rather than creating actual plots, here I create a list to stand-in for a plot. As a matter of fact, plots are saved as lists in R (a list of properties that make up the plot).)
Saving a collection of different objects into one list also enables us to delete a bunch of objects at once:
# r emove all objects but these two:
rm(list = ls()[!(ls() %in% c("lMyList", "lMyPlots"))])
# If you want to iteratively (e.g. after each section of your code) delete a
# bunch of objects, use this code:
# first define a vector containing the names of the objects you want to keep at
# this point:
= c("vToKeep", "sMyPath", "mData1", "mData2")
vToKeep # then delete all objects but these three:
rm(list = ls()[!(ls() %in% vToKeep)])
# at some later point, if you want to delete more things, add them to vToKeep:
= c(vToKeep, "obj1", "obj2")
vToKeep # and again delete all objects but the ones stored in vToKeep:
rm(list = ls()[!(ls() %in% vToKeep)])
# (note that vToKeep contains the name of 'vToKeep' itself, otherwise this
# vector would be deleted too)
Another important object in R is a dataframe. Just like a list is akin to a vector which can contain elements of different types, a dataframe is akin to a matrix which can be composed of columns of different types. (Vectors and matrices must have all elements of the same type, e.g. “double” for numbers or “character” for strings.) Not surprisingly, in R, data is mostly stored in dataframes.
= data.frame(a = c(10, 20, 30), b = c("RA", "RB", "RC"))
dfA
print(dfA)
## a b
## 1 10 RA
## 2 20 RB
## 3 30 RC
"b"] # access column 'b' dfA[,
## [1] "RA" "RB" "RC"
$b # other way to access column 'b' dfA
## [1] "RA" "RB" "RC"
"c"] = c(8, 9, 4) # add a column 'c'
dfA[,
$d = c(1, 0, 1) # add a column 'd'
dfA
# create new dataframe by deleting column 'd', i.e. keeping only all other
# columns:
= dfA[, c("a", "b", "c")]
dfB
subset(dfA, select = c(a, b, c)) # other way to access all columns but 'd'
## a b c
## 1 10 RA 8
## 2 20 RB 9
## 3 30 RC 4
# the latter approach is useful for taking all columns but a few ones:
subset(dfA, select = -c(c, d)) # take all columns but 'c' and 'd'
## a b
## 1 10 RA
## 2 20 RB
## 3 30 RC
# (note that writing ' dfA[,-c('c','d')] ' gives an error)
To introduce some more commands for dealing with dataframes, we first load a dataset that is pre-stored in R.
# load dataset pre-stored in R:
data(mtcars)
# it contains various measures for 32 different cars
# to verify that the loaded object (dataset) 'mtcars' is a dataframe:
is.data.frame(mtcars)
= colnames(mtcars) # store column names (= variables)
vsVariables = rownames(mtcars) # store row names (= cars)
vsCars
ncol(mtcars) # how many columns do we have?
# alternatively: type ' length(vsVariables) '
nrow(mtcars) # how many cars do we have?
# alternatively: type ' legnth(vsCars) '
mean(mtcars$hp) # compute mean horse power
# Describe data (i.e. each column (= variable)):
summary(mtcars)
# alternative, from package 'psych':
library(psych)
describe(mtcars)
# How many cars have an hp higher than average?
# step 1: create vector of logicals, indicating for each car whether its hp is
# above average:
= mtcars$hp > mean(mtcars$hp)
vIsHPhigh # step 2: compute the sum of this vector (i.e. count how many elements have a
# TRUE (=1) entry)
sum(vIsHPhigh)
# another way: create a sub-dataframe containing only the cars with
# above-average hp, and count its rows:
nrow(mtcars[vIsHPhigh, ])
# Find cars with hp above 190 and 5 gears:
$hp > 190 & mtcars$gear == 5, ]
mtcars[mtcars# if you're interested only in their number of cylinders and horse power:
$hp > 190 & mtcars$gear == 5, c("cyl", "hp")]
mtcars[mtcars
# Find cars with 4 or 6 cylinders and hp above 120:
= mtcars$cyl %in% c(4, 6)
vRightCyl = mtcars$hp > 120
vRightHP & vRightHP, ]
mtcars[vRightCyl
# Find all Mercedes (their names start with 'Merc'):
grepl("Merc", vsCars), ]
mtcars[
# Sort dataset first according to number of cylinders, then according to
# horsepower:
order(mtcars$cyl, mtcars$hp), ]
mtcars[
# Create dummies for each value of 'cyl' and 'gear', and add to dataframe:
library(fastDummies)
= dummy_cols(mtcars, select_columns = c("cyl", "gear")) data
Let’s take a look at how to add/delete observations/variables to/from an existing dataframe.
= data.frame(country = c("Zimbabwe", "Cameroon", "Kenya", "Rwanda"), year = rep(2010,
df1 4), outcome1 = c(4.5, 3, 4.5, 6))
show(df1)
## country year outcome1
## 1 Zimbabwe 2010 4.5
## 2 Cameroon 2010 3.0
## 3 Kenya 2010 4.5
## 4 Rwanda 2010 6.0
# Add/delete column:
= cbind(df1, outcome2 = c(1, 0, 0, 1)) # add column 'outcome2'
df1
= subset(df1, select = -c(outcome2)) # delete column 'outcome2'
df1 # alternatives: df1 = df1[ , c('country','year','outcome1')] df1 = df1[ ,
# colnames(df1) != 'outcome2']
# Add/delete row:
= rbind(df1, list("Djibouti", 1990, 9)) # add row
df1
= df1[df1$country != "Djibouti", ] # delete this row again
df1 # alternatives: df1 = df1[-nrow(df1),]
# note: adding a row like this instead makes the entries in columns 'year' and
# 'outcome1' become strings:
= rbind(df1, c("Djibouti", 1990, 9))
df1 # (the reason is that a vector, in contrast to a list, has to have all entries
# of the same type, and so 1990 and 9 become strings which turns the whole
# columns 'year' and 'outcome1' in df1 to strings) and deleting the row does
# not make the problem go away:
= df1[df1$country != "Djibouti", ]
df1
# instead, when you have a column that is a string (type character) but is
# supposed to record numbers, declare the column to be a numeric vector:
$outcome1 = as.numeric(df1$outcome1) df1
Merging two dataframes:
= data.frame(country = rep(c("Zimbabwe", "Ghana"), each = 3), year = rep(c(2010,
df2 2015, 2020), times = 2), outcome1 = c(4.5, 3, 2, 6, 6, 5))
# Note that we have a duplicate entry in df1 and df2: both have Zimbabwe for
# 2010
show(df2)
## country year outcome1
## 1 Zimbabwe 2010 4.5
## 2 Zimbabwe 2015 3.0
## 3 Zimbabwe 2020 2.0
## 4 Ghana 2010 6.0
## 5 Ghana 2015 6.0
## 6 Ghana 2020 5.0
# 'merge' merges the two dataframes based on common column names:
= merge(df1, df2, all = TRUE) dfBoth
Now we have a dataset in which an observation is defined by several variables (here: country and year). They can be cast in long- as well as short format, and there are some particular commands for them:
# Reshape dataset from/to long/short format:
library(reshape2) # contains commands melt and dcast
= dcast(dfBoth, country ~ year) #from long to short format
dfBothShort
# note that going from long to short, we get rid of the column names 'year' and
# 'outcome1'
= melt(dfBothShort, id.vars = c("country")) # from short to long format
dfBoth
# to supply column names 'year' and 'outcome1' again:
colnames(dfBoth)[-1] = c("year", "outcome1") # (first column is already 'country')
# Do something for given subset:
library(doBy) #contains command 'summaryBy'
# it repeats a command for each entry of a variable (e.g. for each country, or
# each year):
# e.g. mean of outcome1 by country:
summaryBy(outcome1 ~ country, FUN = mean, data = dfBoth)
# number of missings in outcome1 by country:
summaryBy(outcome1 ~ country, FUN = function(x) {
sum(is.na(x))
data = dfBoth)
},
# show whether outcome1 is above 5 by country and by year:
summaryBy(outcome1 ~ country + year, FUN = function(x) {
> 5
x data = dfBoth) },
Analyze missing values:
# Find rows (=observations) which have no missings:
# vector of logicals indicating whether row is complete:
= complete.cases(dfBoth)
vIsObsComplete # vector of logicals indicating whether row has missings:
= !vIsObsComplete
vObsHasMissings
# Find rows which have missings in a particular column:
= is.na(dfBoth$outcome1)
vObsHasMissings # (if your missing is coded as '.', write ' dfBoth$outcome1 == '.' ', or first
# re-code the missings to NA)
# Find columns with missings:
= apply(dfBoth, 2, function(x) {
vColHasMissings any(is.na(x))
})
Let’s see how we can define and change the path. A path is the location on your computer (think of a folder) which R uses to look for datasets you call and to store things you tell it to store.
# Have a look at the folder (path) that R is currently in:
getwd()
# Change the path:
= "/Users/markomlikota/Documents/MySpecialFolder"
sMyPath setwd(sMyPath)
# To change the path to a sub-folder inside 'MySpecialFolder':
setwd(paste(sMyPath, "/MySpecialSubFolder", sep = ""))
# (if you go regularly between your main folder and this subfolder, it makes
# sense to store the latter just as we stored the former):
= paste(sMyPath, "/MySpecialSubFolder", sep = "")
sMySubPath setwd(sMySubPath)
# Paths are computer-specific. But if you collaborate with several people via
# Dropbox, you can define the path relative to the Dropbox folder, so that
# everyone can use the same path definition and hence run the same code:
setwd("~/Dropbox/OurSpecialFolder")
# (provided that everyone has 'OurSpecialFolder' in the same location inside
# their dropbox)
# List all folders in the current path:
dir()
# Create a folder 'abc' in the current path:
dir.create("abc")
# Save all objects from the current workspace to an '.RData'-file:
save.image(file = "workspace.RData")
# Load the workspace from the file:
load("workspace.Rdata")
If you have a big project, it makes sense to break up your code into several R-scripts. From one script, you can call another using the following command:
source("util.R") # load/run external script
In my experience, due to type conversion issues, it’s much easier to work with csv files than with excel (xlsx) files in R. So the best practice is to just convert any xlsx files you need first manually into csv files. You can also do basic adjustments (like deleting unnecessary rows) manually in the xlsx/csv file.
The following commands are useful for reading-in files (datasets) into R:
# Read in .csv file:
# simply:
= read.table("exams.csv", sep = ",")
myData # with more options:
= read.csv("exams.csv", header = TRUE, sep = ",", quote = "\"", dec = ",",
myData fill = TRUE, comment.char = "", skip = 11, nrows = 1083)
# Read in .xslx file:
library(readxl)
= read_excel("my_file.xlsx", sheet = "data")
my_data
# Read in .dta file:
library(haven)
= read_dta("GMdata.dta")
myData
# There are many packages to download data directly from websites, e.g.
# 'quantmod' to download data from yahoo or FRED database, or 'imfr' package
# for downloading data from IMF
# Download file from url:
library(RCurl)
download.file("https://api.census.gov/data/1994/cps/basic/jan?tabulate=weight(PWCMPWGT)&col+PEEDUCA&row+PEMLR",
"./Data/1994jan", "libcurl")
# If you had to take an xlsx file and some columns of it are not recognized as
# a numeric by R (but as a factor), this can be useful:
= sapply(myData, is.factor)
indx = lapply(myData[indx], function(x) as.numeric(as.character(x)))
myData[indx]
# You might also need to convert columns of type 'character' to type 'numeric':
-1] = lapply(myData[, -1], function(x) as.numeric(x))
myData[, # (here it's done for all but first column, say because it contains a list of
# countrynames, which are supposed to be strings)
The following commands are useful for writing a dataframe to a file:
# Write .csv files:
write.table(myData, "exams.csv", sep = ",")
# to preserve special encoding (e.g. cyrillic characters), use this:
library(readr)
write_excel_csv(x = myData, file = "exams.csv", col_names = T)
# Write .xslx files:
library(openxlsx)
write.xlsx(myData, "exams.xlsx")
# write several sheets: (same package)
= createWorkbook("outputHere")
wb addWorksheet(wb, "alpha")
addWorksheet(wb, "sigma")
addWorksheet(wb, "further")
writeData(wb, sheet = "alpha", mData1)
writeData(wb, sheet = "sigma", mData2)
writeData(wb, sheet = "further", mData3)
saveWorkbook(wb, sFileName, overwrite = TRUE)
Statistical software like R can be used to draw random numbers from some specified distribution. These numbers are not truly random, but appear indistinguishable from actual random numbers. To make the analysis replicable, one must fix the so-called “seed”, which ensures that every time the code is run, the same quasi-random numbers are drawn.
# fix seed
set.seed(50) # choose some (any) number here
# Uniform distribution:
runif(3) # three draws from U(0,1)
# Normal distribution:
rnorm(5, mean = 3, sd = 1) # 5 (independent) draws from Normal
pnorm(0.1) # return standard normal cdf at given point
dnorm(0.1) # return standard normal pdf at given point
qnorm(0.5) # point at which standard normal cdf gives 0.5
# hence, to get critical values for two-sided 95% level test:
qnorm(0.025)
qnorm(0.975)
library(mvtnorm) # for multivariate normal
= c(0, 5)
vMeans = matrix(data = c(0.4, 0.2, 0.2, 0.9), nrow = 2, ncol = 2)
mVariance rmvnorm(1, vMeans, mVariance) # single draw from multivariate Normal
# t-distribution:
rt(5, 32) # 5 draws, df=32
# Chi-squared:
dchisq(5, 9) # 5 draws, df=9
# Inverse Gamma:
library(invgamma)
rinvgamma(5, 10, 2) # 5 draws, parameters (10,2)
# Inverse Wishart:
library(MCMCpack)
= 5
nu = matrix(data = c(0.4, 0.2, 0.2, 0.9), nrow = 2, ncol = 2)
mS riwish(v = nu, S = mS) #single draw
# Multinomial distribution: i.e. sample 2 times (the elements) from vector
# 'vOutcomes':
= c(3.4, 5.3, 3.2)
vOutcomes sample(vOutcomes, 2, replace = TRUE)
# by default, R assumes that all elements have equal probabilities,
# alternatively supply vector of probabilities and type:
= c(0.2, 0.2, 0.6)
vProbabilities sample(vOutcomes, 2, replace = TRUE, prob = vProbabilities)
# Fit Kernel-density-estimate on vector 'vOutcomes':
density(vOutcomes)
For optimization of scalar-valued function w.r.t. a scalar argument, it is easiest to use “optimize”:
= function(v) {
fFunToOpt - 3)^2
(v
}= optimize(fFunToOpt, interval = c(-5, 5))
o $minimum o
For optimization of scalar-valued function w.r.t. a vector-valued argument, can use “optim”:
= c(0, 0, 0) # initial value
vTheta0 optim(vTheta0, fFunToOpt, method = "BFGS")
# function can support several different methods, see documentation
First, let’s take a look at how to run and analyze linear regressions in R:
data(mtcars) # load data for examples
# Specifying a linear regression:
# regress hp on intercept, mpg and gear:
= lm(mtcars$hp ~ mtcars$mpg + mtcars$gear)
reg # (here we simply use three vectors) (alternatively, when using a dataframe,
# like here, can write)
= lm(hp ~ mpg + gear, data = mtcars)
reg # Doing it manually is also easy (and one can be sure what it's doing):
= mtcars[, c("mpg", "gear")]
mX = as.matrix(cbind(rep(1, nrow(mX)), mX))
mX = mtcars$hp
mY = solve(t(mX) %*% mX) %*% (t(mX) %*% mY)
vBetaHat # and so on for standard errors, t-values, etc., whatever you need
# drop intercept:
lm(hp ~ -1 + mpg + gear, data = mtcars)
# include dummy for 8 cylinders:
lm(hp ~ mpg + gear + cyl == "8", data = mtcars)
# include dummy for each value of cylinders:
lm(hp ~ mpg + gear + factor(cyl), data = mtcars)
# add mpg-squared:
lm(hp ~ mpg + I(mpg^2) + gear, data = mtcars)
# add as covariates all polynomials of 'mpg' and 'gear' up to order 2:
library(stats)
lm(hp ~ poly(as.matrix(mtcars[, c("mpg", "gear")]), degree = 2, raw = TRUE), data = mtcars)
# the first argument of 'lm' is of the form ' Y ~ x1 + x2 ... + xK ' and is
# called a formula. Sometimes we want to specify this formula in a loop, in
# which case generating it from a string can be useful: e.g. this code can be
# used to run three regressions, first inflcorr1 on geodesic, then inflcorr2 on
# geodesic, then inflcorr3 on geodesic:
for (hh in 1:3) {
formula(paste("inflcorr", hh, " ~ geodesic", sep = ""))
}
# Analyzing the output of a linear regression:
$residuals # vector of residuals
reg$coefficients # vector of estimated coefficients
reg$fitted.values # vector of fitted values
reg
summary(reg) # all sorts of summary stats
= summary(reg) #store
mySumReg $r.squared # extract r-squared
mySumReg$adj.r.squared
mySumReg$coefficients # estimated coefficients, SEs, t-vals, p-vals
mySumReg
# note: by default, homoskedasticity is assumed to compute
# heteroskedasticity-robust variance:
library(sandwich)
= vcovHC(reg, type = "HC")
mVarHskRob # .. whose diagonal gives variances of individual coeff-estimates, and taking
# square roots we get the heterosk-robust SEs:
sqrt(diag(vcovHC(reg, type = "HC")))
# Note: type = 'HC0' has no DF correction whil type = 'HC1' has the DF
# correction. Both are valid asymptotically, however a DF correction is
# preferrable in small samples
# To print results with robust SEs, use:
library(lmtest)
coeftest(reg, vcov. = mVarHskRob)
# or:
summary(reg, vcov = function(x) vcovHC(x, type = "HC"))
BIC(reg) # compute Bayesian information criterion
# compute Schwartz information criterion (manually):
= mean(reg$residuals^2)
MSE = length(reg$coefficients) # K = # of regressors
K = length(reg$residuals) # N = # of obs
N = (K/N) * log(N) + log(MSE) SIC
# F-test:
# manually: 1. run restricted regression, store R^2 2. run unrestricted
# regression, store R^2, number of regressors, number of restrictions, and
# compute:
= ((R2_unr - R2_r)/q)/((1 - R2_unr)/(n - k_unr - 1))
F
# package:
library(car)
linearHypothesis(reg, test = "F", c("mpg=0", "gear=0"), vcov. = vcovHC(reg, type = "HC"))
While there are packages for many more econometrics-related computations in R, I strongly advocate for using these standard commands only to run preliminary regressions as part of your data-analysis process, but not for the final output in a paper or otherwise important document. If the output is really important, you should take your time to code the estimates, compute the standard errors, conduct hypothesis tests, etc., yourself. Only then you know exactly how your results have been obtained and what the assumptions behind them are.
More advanced (cross-sectional) econometric models:
# Probit model:
# regress dummy on having hp above average on mpg and gear:
= glm((hp > mean(mtcars$hp)) ~ mpg + gear, family = binomial(link = "probit"),
reg data = mtcars)
summary(reg)
# IV Methods:
library(ivreg)
# the following runs a 2SLS regression, where mpg is endo regressor, and we are
# using 'cyl' and 'disp' as IVs:
ivreg(hp ~ mpg + gear | gear + cyl + disp, data = mtcars)
# i.e. hp is dependent variable, next come all covariates in that second stage
# regression, endogenous and exogenous, and then after the | come the exogenous
# covariates and IVs
# Least-Absolute Deviations regresssion:
library(quantreg)
= rq(hp ~ mpg + gear, data = mtcars)
lad_two summary(lad_two)
# Load data for examples:
library(AER)
data(Fatalities)
# declare as panel data:
library(plm)
= pdata.frame(Fatalities, index = c("state", "year"), drop.index = F, row.names = T)
pdata
# compute ratio of number of fatal accidents to population:
"fatalToPop"] = pdata$fatal/pdata$pop
pdata[,
# regress it on income, drinking age and latter's interaction with percentage
# of young drivers: POLS:
= plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata,
reg model = "pooling")
# FE-Within:
= plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata,
reg model = "within", index = c("state", "year"))
# compute clusterd standard errors:
vcovHC(reg, cluster = "group", type = "sss")
# there are many ways to cluster your standard errors, that's why it's best if
# you code this yourself
# The following functions are useful when doing PD F.E. regressions manually:
# compute FDs of a variable for each unit:
library(doBy)
= function(mX, indUnit, indCol) {
fFDColumnByUnit = summaryBy(list(c(colnames(mX)[indCol]), c(colnames(mX)[indUnit])), FUN = function(x) {
mHelp -1] - x[-length(x)]
x[data = mX)
}, as.vector(t(as.matrix(mHelp[, -1])))
}# compute time-demeaned values of a variable for each unit:
= function(mX, indUnit, indCol) {
fDemeanColumnByUnit = summaryBy(list(c(colnames(mX)[indCol]), c(colnames(mX)[indUnit])), FUN = function(x) {
mHelp - mean(x)
x data = mX)
}, as.vector(t(as.matrix(mHelp[, -1])))
}
# compute FDs of fourth column (unit (state) is specified in first column):
fFDColumnByUnit(pdata, 1, 4)
# compute time-demeaned fourth column:
fDemeanColumnByUnit(pdata, 1, 4)
# Load data for examples:
library(carData)
data(Hartnagel)
# Create xts object, which facilitates working with TS data:
library(xts)
# first create vector of dates:
= seq(as.Date("1931-01-01"), length = nrow(Hartnagel), by = "years")
vDates # then create xts-vector of your TS:
= xts(x = Hartnagel$mtheft, order.by = vDates)
vMTheft # or, multiple variables (=columns) at once, this creates an xts-matrix:
= xts(x = Hartnagel[, c("mtheft", "ftheft")], order.by = vDates)
mTheft
# Download data from FRED Database (and other sources) with package quantmod:
library(quantmod)
# download monthly US CPI data:
= getSymbols("CPIAUCSL", src = "FRED", auto.assign = FALSE)
cpi # gives you an xts object, which we call here 'cpi'
# download quarterly GDP and adjust the dates:
getSymbols("GDP", src = "FRED") # this gives you an xts object 'GDP'
# this is how you adjust dates with xts objects:
= GDP["1960-01-01/2020-01-01"]
GDP # or, for later re-use:
= "1960-01-01"
sStartDate = "2020-01-01"
sEndDate = function(x) {
fMyTimeRange paste(sStartDate, sEndDate, sep = "/")
}= fMyTimeRange(GDP)
GDP
# get quarterly GDPdeflator:
getSymbols("GDPDEF", src = "FRED")
= fMyTimeRange(GDPDEF)
GDPDEF
# Easy manipulations with xts objects:
= apply.quarterly(cpi, last) # keep only beginning-of-quarter months
cpi
diff(cpi, 1) # compute first-differences
diff(cpi, 4) # compute fourth-differences (y_t - y_{t-4})
= diff(log(cpi)) * 4 # compute annualized inflation from quarterly CPI
infl
# fit AR(2) model on inflation:
= lm(infl ~ lag(infl, 1) + lag(infl, 2)) #similarly: lead()
reg # note: lag() works only for xts objects! for ts objects it doesn't, but you
# get no error message!!!
# compute ACF and PACF:
acf(infl[-1])
pacf(infl[-1])
# compute cross-correlations of two series (here up to 3 leads and lags):
= diff(log(GDPDEF))[-1]
vInfl = diff(log(GDP))[-1]
vGDPg ccf(as.numeric(vInfl), as.numeric(vGDPg), 3, plot = FALSE)
# note: again package 'lubridate' is useful to work with dates:
library(lubridate)
= index(cpi) # access vector of dates corresponding to xts-object 'cpi'
vDatesCPI month(index(cpi)) # see the month for each date
# For some things, like fitting ARMA models and forecasting with them, it's
# easier to work with 'ts'-objects rather than 'xts'-objects:
# create ts-object:
= ts(infl, start = c(1960, 3), frequency = 4)
vInflTS = arima(vInflTS, order = c(1, 0, 0)) # estimate AR(1) process
model = arima(vInflTS, order = c(2, 0, 1)) # estimate ARMA(2,1) process
model # the middle argument stands for integration order (ARIMA process)
library(forecast)
# forecast up to 10 periods ahead and compute two-sided 90% and 95% CIs:
= forecast(model, h = 10, level = c(90, 95))
mForecasts # plot these forecasts with fanchart:
plot(mForecasts, PI = TRUE, showgap = FALSE, shaded = TRUE)
# Estimate reduced-form VAR:
library(vars)
= data.frame(GDP = vGDPg, INFL = vInfl)
mData = VAR(mData, p = 2) # add type='trend' to allow for trend
modelVAR = residuals(modelVAR)
vResids = summary(modelVAR)
sumModelVAR = sumModelVAR$covres # variance-covariance matrix of errors
mSigma coef(modelVAR) # coefficients (Phi's)
# forecast:
predict(modelVAR, n.ahead = 5, level = c(90, 95))
With R, one is very flexible in creating plots (a primary reason behind its popularity). Before we get to more advanced plotting-techniques, based on the package “ggplot2”, let’s start with some basic plots, using pre-built commands in R. They are useful for checking quickly how something looks in the data-exploratory process (and for inclusion in preliminary reports).
library(carData)
data(Hartnagel)
data(mtcars)
# Plot one variable (=vector):
= c(10, 12, 17)
vY = c(11, 10, 12)
vZ
= plot(vY) # plots three dots at 10, 12 and 15 against 1, 2 and 3 (indices)
myfirstplot
# add type='l' to get line instead of dots:
plot(vY, type = "l")
lines(vZ, col = "red") # add red line on top
# thicker lines, y-limits, color, title, x- and y-axis-labels:
plot(vY, type = "l", lwd = 4, ylim = c(4, 18), col = "green", main = "My Title Here",
xlab = "my x lab", ylab = "my y lab")
# further options:
= "n" # to remove ticks/numbers for x axis, similar for y
xaxt
# all of these options hold also for the following plots
# Plot one variable (=vector) against another:
# scatter:
plot(mtcars$hp, mtcars$mpg)
# line-plot:
plot(Hartnagel$year, Hartnagel$ftheft, type = "l")
# add regression line:
abline(lm(Hartnagel$ftheft ~ Hartnagel$year), col = "red")
# add vertical line at x=1950:
abline(v = 1950, col = "green")
# can also add horizontal line, or line with specific intercept and slope
# Histogram:
hist(mtcars$hp)
# Boxplot:
boxplot(mtcars$hp)
# boxplot of one column in dataframe by year (other column in dataframe):
boxplot(hp ~ cyl, data = mtcars, xlab = "cyls", ylab = "", main = "Boxplot of HP by #cyls")
# To save plot:
# first specify your path:
= "/Users/markomlikota/Documents/W_Teaching"
sMyPath setwd(sMyPath)
# specify name:
png("blabla.png") # specify width and height here for nice plots
# create the plot (or print it if already created):
boxplot(mtcars$hp)
# tell R that you're done: (no further lines or so will be added)
dev.off()
# We can also plot some functions directly:
plot(dnorm, xlim = c(-10, 10)) # standard-normal pdf
plot(exp, xlim = c(1, 3)) # exponential function
= function(x) {
fMyOwn - 3)^2
(x
}plot(fMyOwn, xlim = c(-5, 5)) #my own function
# Note that when plotting xts objects, we get a plot with lots of
# default-adjustments (it realizes that x-axis should be time, has a grid,
# etc.):
library(quantmod)
getSymbols("GDPDEF", src = "FRED")
= GDPDEF["1960-01-01/2020-01-01"]
GDPDEF = diff(log(GDPDEF))[-1]
vInfl
plot(vInfl, main = "Inflation (quarterly, annualized, %)")
# Multiple plots (panels) in one:
getSymbols("GDP", src = "FRED")
= GDP["1960-01-01/2020-01-01"]
GDP = diff(log(GDP))[-1]
vGDPg
par(mfrow = c(2, 1)) # 2 x 1 plot
plot(vInfl, main = "Inflation (quarterly, annualized, %)") # specify first
plot(vGDPg, main = "GDP Growth (quarterly, %)") # specify second
par(mfrow = c(1, 1))
# see https://www.r-graph-gallery.com/74-margin-and-oma-cheatsheet.html for
# adjusting margins
# Empty plot: (useful when creating e.g. a plot with 3 x 3 subplots and you
# have only 8 plots)
plot(0, type = "n", axes = FALSE)
# To show an already created plot, or to show plot when created inside a loop:
print(myfirstplot)
More advanced plots can be created with the package “ggplot2”. We’ll start with a plot containing a single line and then discuss multiple lines, scatter plots, histograms and other plot-types. Various plot-options are discussed, and each is introduced as part of the discussion of the plot-type for which it is best explained.
(Sidenote: a further popular package for plotting in R is “plotly”. However, in my view “ggplot2” is better in all aspects but creating interactive plots.)
library(ggplot2)
# Empty plot:
ggplot() + theme_void()
# Based on two vectors:
= seq(-2, 5, length.out = 100)
vX = exp(vX)
vY ggplot() + geom_line(aes(x = vX, y = vY))
# based on a dataframe:
= data.frame(col1 = vY, col2 = vX)
mData ggplot(data = mData) + geom_line(aes(x = col2, y = col1))
# to change appearance of line, specify further attributes of the option
# 'geom_line()':
ggplot(data = mData) + geom_line(aes(x = col2, y = col1), colour = "blue", size = 0.9)
# add further options with pluses:
= ggplot(data = mData) + geom_line(aes(x = col2, y = col1), colour = "blue", size = 0.9) +
pp labs(x = "my x lab", y = "my y lab", title = "my title here")
# can add further specifications/add-ons by calling already created plot:
+ theme_bw() + theme(aspect.ratio = 5/8)
pp
# You may want to re-use some options over and over again. Instead of typing
# them every time, store them as a list and call them all together:
= list(theme_bw(), theme(aspect.ratio = 5/8))
plotOptions # print the plot 'pp' with these options:
+ plotOptions
pp # store these options in the plot 'pp':
= pp + plotOptions
pp
# Save a ggplot:
#call it (print it)
pp ggsave("part2.png") #save it
# advanced options:
ggsave("part2.png", width = 15, height = 15 * (8/5), units = "cm")
# width/height should match previously specified aspect ratio
# Note: for inclusion in latex-document, save plot best as pdf
# Further options:
# have label as latex code:
library(latex2exp)
= TeX("$\\sigma + h_t $ (bla bla)")
sMyXlab + labs(x = sMyXlab)
pp
# adjust ticks and limits of x- and y-axes:
+ scale_x_continuous(breaks = seq(-3, 7, 2), limits = c(-3, 7)) + scale_y_continuous(limits = c(0,
pp 180))
# instead of ' scale_y_continuous(limits=c(0,180)) ', can also write '
# ylim(0,180) ' or ' expand_limits(y=c(0,180)) ', and analogously for x. when
# changing only upper limit, type e.g. ' ylim(NA,180) '
# change axis values from decimals to percent, e.g. 0.01 becomes 1%
+ scale_x_continuous(labels = scales::percent)
pp # manually change labels of ticks on x-axis:
+ scale_x_continuous(limits = c(-3, 7), breaks = c(-3, 2, 3, 5, 7), labels = paste(c(-3,
pp 2, 3, 5, 7), "a"))
# note: we can skip a label, but include a tick by specifying empty labels for
# certain breaks:
= paste(c(-3, 2, 3, 5, 7), "a")
vLabs 2] = ""
vLabs[+ scale_x_continuous(limits = c(-3, 7), breaks = c(-3, 2, 3, 5, 7), labels = vLabs)
pp # have x-axis in units of pi:
= vX/pi
vxDivPi + scale_x_continuous(limits = range(vX), breaks = pi * (-1:2), labels = paste(-1:2,
pp "pi"))
# same using latex-label for pi:
+ scale_x_continuous(limits = range(vX), breaks = pi * (-1:2), labels = TeX(paste(-1:2,
pp "$\\pi$")))
# force all tick-labels to have same number of decimals (this is useful if
# x-axis is -0.04:0.04, and you want to have 0.00 instead of 0):
= function(x) sprintf("%.2f", x)
scaleFUN + scale_x_continuous(breaks = c(-3, 2, 3, 5, 7), lab = scaleFUN)
pp
# remove axis ticks and tick-labels in y-axis:
+ theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
pp # change font size for tick-labels in x-axis and vertical space between axis
# and labels:
+ theme(axis.text.x = element_text(size = 12, vjust = -1))
pp # change font size for x-axis label:
+ theme(axis.title.x = element_text(size = 12))
pp # rotate x-axis labels:
+ theme(axis.text.x = element_text(angle = 45, hjust = 1)) # can also specify v-just
pp
# change appearance of title:
+ theme(plot.title = element_text(family = "sans", size = 18, margin = margin(0,
pp 0, 30, 0)))
# (instead of margin, can also specify vjust and hjust)
# add margins of 1cm all around plot:
+ theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
pp
# remove grid:
+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
pp # remove panel border (the box around the main plot area):
+ theme(panel.border = element_blank())
pp # remove panel border but add x- and y-axis lines:
+ theme(panel.border = element_blank()) + theme(axis.line = element_line(colour = "black",
pp size = 0.2))
# shading of specified area (useful for confidence intervals):
+ geom_ribbon(aes(ymin = rep(0, 100), ymax = vY * 0.9, x = vX), alpha = 0.2, fill = "red")
pp
# Annotate text at given (x,y)-point:
+ annotate("text", label = "abc", x = -1, y = 150, hjust = 0.1, vjust = 0.9)
pp
# Flip axes:
+ coord_flip()
pp
# note that the options can be combined, and they have to be if they belong to
# the same option-command; e.g. don't write ' theme(panel.border =
# element_blank()) ' and then ' theme( axis.title.x = element_text(size=12) )
# ', but write ' theme( panel.border = element_blank(), axis.title.x =
# element_text(size=12) ) '
# Create dataframe for examples:
= seq(0.01, 0.5, length.out = 10)
vX = cbind(vX, exp(vX), vX^2, vX^1.5)
mData2 colnames(mData2) = c("myX", "myY1", "myY2", "myY3")
= as.data.frame(mData2)
mData2
# Two lines:
ggplot(mData2, aes(x = myX)) + geom_line(aes(y = myY1), size = 1.1) + geom_line(aes(y = myY2),
size = 1.5)
# Multiple (many) lines:
# to plot several lines, it is best to first 'melt' the dataframe into a long
# format:
library(reshape2)
= melt(mData2, id.vars = "myX")
mDat2 # the different variables 'myY1', 'myY2' and 'myY3' are now distinguished in
# the column 'variable', their values are shown in the column 'value'
# now we can let ggplot take different linetypes for different values of
# 'variable', i.e. for our three different variables/lines:
= ggplot(mDat2, aes(x = myX, y = value, group = variable)) + geom_line(aes(linetype = variable))
pp1 # or colors:
= ggplot(mDat2, aes(x = myX, y = value, group = variable)) + geom_line(aes(color = variable))
pp2 # or line-colors and point-colors and point-shapes:
= ggplot(mDat2, aes(x = myX, y = value, group = variable)) + geom_line(aes(color = variable),
pp3 size = 0.5) + geom_point(aes(color = variable, shape = variable), size = 2)
# set linetypes manually:
+ scale_linetype_manual(values = c("solid", "dashed", "dotdash"))
pp1 # (note: no problem when provided vector is longer than the number of lines you
# have) see http://sape.inf.usi.ch/quick-reference/ggplot2/linetype for
# linetypes available
# to also label manually lines that are defined by different linetypes:
+ scale_linetype_manual(values = c("solid", "dashed", "dotdash"), labels = c("mycurve1",
pp1 "mycurve2", "mycurve3"))
# set colors manually:
library(colorspace)
# take three colors (evenly spaced) from palette 'Blues 2':
= sequential_hcl(3, palette = "Blues 2")
vMyColors # or take 9 colors and then take three ones from the middle:
= sequential_hcl(9, palette = "Blues 2")[c(3, 5, 7)]
vMyColors + scale_color_manual(values = vMyColors)
pp2 # again, add 'labels = ' to specify labels for the lines
# set point-shapes manually:
+ scale_shape_manual(values = c(7, 9, 24))
pp3 # see http://sape.inf.usi.ch/quick-reference/ggplot2/shape for shapes available
# Legend options:
# removes legend:
+ theme(legend.position = "none")
pp1 # remove legend title:
+ theme(legend.title = element_blank())
pp1
# by default, legend is vertical and outside the plot, on the right side. To
# change its position, keeping it outside, use e.g.
+ theme(legend.justification = "bottom")
pp1 # have legend outside the plot, on the bottom, in horizontal order:
+ theme(legend.direction = "horizontal", legend.justification = "left", legend.position = "bottom")
pp1
# change legend position inside the plot: ( top right is (1,1), top left is
# (0,1))
+ theme(legend.position = c(0.7, 0.7))
pp1
# legend with transparent background:
+ theme(legend.key = element_rect(colour = "transparent", fill = "transparent")) pp1
Special care is required if you have time on the x-axis:
# here with one single curve, but same principle applies with several lines:
= data.frame(date = index(vGDPg), GDP = vGDPg, INFL = vInfl)
mData3
# when x-axis is date, use the option 'scale_x_date()' rather than
# 'scale_x_manual()':
= ggplot(mData3, aes(x = date)) + geom_line(aes(y = GDP), size = 1.1) + scale_x_date(date_labels = "%Y",
pp date_breaks = "2 years", limits = as.Date(c("2000-01-01", "2020-01-01")))
pp
# when you have a string-vector containing the dates which is not in the format
# required by R ('YYYY-MM-DD'), change it before plotting using this command:
as.Date(vMyDates, "%d.%m.%Y")
# e.g. for a single date:
as.Date("31.01.2020", "%d.%m.%Y")
# always need to use this R-date-format when specifying something in units of
# the x-axis, e.g. when annotating text:
+ annotate("text", label = "abc", x = as.Date("2003-01-31"), y = 0.1, hjust = 0.1,
pp vjust = 0.9)
data(mtcars)
# Simple scatter of hp against mpg:
ggplot() + geom_point(aes(x = mtcars$mpg, y = mtcars$hp))
# or:
= ggplot(mtcars, aes(x = mpg, y = hp)) + geom_point(aes(x = mpg, y = hp))
pp4
# Have color of dots different for different number of cylinders:
ggplot(mtcars, aes(x = mpg, y = hp, group = cyl)) + geom_point(aes(color = cyl))
# Note that R thinks of cyl as a continuous variable, even though it takes only
# three values. To fix that, code it as type 'character' rather than
# 'numeric':
= mtcars # create new dataframe..
mtcars2 $cyl = as.character(mtcars2$cyl) #.. in which cyl is character-column
mtcars2# now it realzes that cyl only has a discrete number of values
ggplot(mtcars2, aes(x = mpg, y = hp, group = cyl)) + geom_point(aes(color = cyl))
# Have shapes of scatter-points differ for different number of cylinders: for
# this we really need to have cyl as a character-column, i.e. R needs to
# realize it's a discrete variable, because we cannot let the shapes change
# continuously:
ggplot(mtcars2, aes(x = mpg, y = hp, group = cyl)) + geom_point(aes(shape = cyl),
size = 2)
# of course, we can let points differ by color as well as shape:
ggplot(mtcars2, aes(x = mpg, y = hp, group = cyl)) + geom_point(aes(shape = cyl,
color = cyl), size = 2)
# as before (plot of multiple lines; see above), we can manually specify colors
# and shapes using 'scale_color_manual' and 'scale_shape_manual'
# Let's now have colors of dots change based on a continuous variable, e.g.
# 'qsec':
= ggplot(mtcars, aes(x = mpg, y = hp, group = qsec)) + geom_point(aes(color = qsec))
pp5 # As opposed to before, here the colors change continuously. We can specify
# which colors we want by specifying the end-points (=colors) of the scale:
= sequential_hcl(5, palette = "Blues 2")[1]
myBlue + scale_colour_gradient(low = "white", high = myBlue)
pp
# Add labels to scatter-points:
= rownames(mtcars)
vMyLabels + geom_text(aes(label = vMyLabels), size = 2, hjust = 1.05, vjust = 0)
pp5 # to make it legible, add labels only for the two cars with highest hp:
= mtcars$hp >= sort(mtcars$hp, decreasing = TRUE)[2]
vIncludeLabel !vIncludeLabel] = NA
vMyLabels[+ geom_text(aes(label = vMyLabels), size = 2, hjust = 1.05, vjust = 0)
pp5 # i.e. we set the labels of points that we do not want labelled to 'NA'
# Add lines to plot:
# line with specified intercept and slope (e.g. could add 45-degree line):
+ geom_abline(intercept = 100, slope = 2)
pp4 # vertical line:
+ geom_vline(xintercept = 18, linetype = "dotted", color = "black", size = 1)
pp4 # regression line:
+ geom_smooth(method = lm, se = FALSE, color = "darkred") pp4
# Histogram of hp:
ggplot(mtcars, aes(x = hp)) + geom_histogram(aes(x = hp, y = ..density..), binwidth = 10,
fill = "grey", color = "black")
# note that here we only need to specify the x-variable in aes()
# instead of binwidth, can also specify number of bins, e.g. bins=50
# Histogram + Line:
= rnorm(500, 2, 1) # 500 draws from N(2,1)
va = seq(-2, 6, 8/(500 - 1)) # grid
vx = dnorm(vx, mean = 2, sd = 1) # actual pdf of N(2,1), evaluated at grid
vb = data.frame(RV = va, stdnorm = vb)
mData4
# here we plot the histogram of 'RV', and we plot 'stdnorm' (vb) against vx:
ggplot(mData4, aes(x = RV, y = stdnorm)) + plotOptions + geom_histogram(aes(x = RV,
y = ..density..), binwidth = 0.5, fill = "grey", color = "black") + geom_line(aes(x = vx,
y = stdnorm), colour = "black")
# Multiple Histograms in one plot:
# here: a separate histogram of hp for each value of cyl, each with different
# color:
ggplot(mtcars2, aes(x = hp, fill = cyl)) + geom_histogram(aes(x = hp, y = ..density..,
color = cyl), color = "white", alpha = 0.6, position = "identity", bins = 20)
# the color specified separately (here white) is the border color of bars
# Barplot:
# compute and plot mean hp by cyl:
library(doBy)
= summaryBy(hp ~ cyl, FUN = mean, data = mtcars)
mData4 colnames(mData4) = c("cyl", "meanhp")
# if we let cyl be a numeric column, R thinks it's continuous, so the default
# plot does not look very nice as we have a continuous x-axis...
= ggplot(data = mData4, aes(x = cyl, y = meanhp)) + geom_bar(stat = "identity",
pp6 color = "white", fill = myBlue)
# ... but we can easily adjust the limits (and breaks and labels) manually:
+ scale_x_continuous(limits = c(2, 10), breaks = c(4, 8), labels = paste(c(4,
pp6 8), "cyls", sep = ""))
# declaring cyl as character-column makes R realize it is a discrete variable:
$cyl = as.character(mData4$cyl)
mData4# then the default plot has a discrete x-axis:
= ggplot(data = mData4, aes(x = cyl, y = meanhp)) + geom_bar(stat = "identity",
pp7 color = "white", fill = myBlue)
# but we cannot do much with the x-axis except change the labels:
+ scale_x_discrete(labels = paste(c(4, 6, 8), "cyls", sep = ""))
pp7
# QQPlot: (not part of ggplot2)
= rt(100, 10) # 100 t-dist. RVs with 10 dfs
vTdistRV qqnorm(vTdistRV, xlab = "N(0,1)", ylab = "t(10)", col = "blue")
qqline(vTdistRV, col = "tomato")
We can also include several ggplots in one plot in R (though when producing a latex document out of the output, one is more flexible when producing the plots individually and assembling them in latex):
# First app
:
ach
library(gridExtra) #need both packages
library(ggpubr)
= text_g
title b("True Parameter Values", size = 18, face = "bold", vjust = 1)
# 1 x 2 plots:
grid.arrange(pp1, pp2, ncol = 2, top = title, widths = c(2, 2), layout_matrix = rbind(c(1,
2)))
# to save:
ggsave("myplot1.pdf", arrangeG
b(pp1, pp2, ncol = 2, top = title, widths = c(2,
2), layout_matrix = rbind(c(1, 2))))
# 2 x 2 plots:
grid.arrange(pp1, pp2, pp2, pp1, ncol = 2, top = title, widths = c(2, 2), layout_matrix = rbind(c(1,
2), c(3, 4)))
# can also add heights=c(3,3), and note that width/height ratio determines
# aspect ratio of plot to save:
ggsave("myplot2.pdf", arrangeG
b(pp1, pp2, pp2, pp1, ncol = 2, top = title, widths = c(2,
2), layout_matrix = rbind(c(1, 2), c(3, 4))))
# Second app
:
ach
library(patchwork)
# 2 x 3 plots:
= (pp1 | pp1 | pp1)/(pp1 | pp1 | pp1)
ppp + plot_annotation(title = "bla bla")
ppp
# can also have more flexible plot-arrangement:
= "
layout AABB
CCDD
EEEE
EEEE
"
= pp1 + pp1 + pp1 + pp1 + pp1 + plot_layout(design = layout)
pp
# if all plots have the same legend, you can collect all them into one single
# legend using the commands:
+ plot_layout(guides = "collect") + theme(legend.position = "bottom")
ppp
# in this app
for saving the plots: but need to
ach, need no special command # create a large plot (as here) because with many plots, default size window is
# usually too small
ggsave("blabla.png", width = 40, height = 25, units = "cm")
# note: changing the width and height will also change the spacing between
# figures
There are many more things that ggplot2 can do. You can explore the functionalities yourself by googling what you need and inspecting the various options one can supply (e.g. “scale_x_manual()”) and their attributes (e.g. “breaks=”). To create a grid-plot, use “geom_raster()” or “geom_rect()”. To create an area-plot, use “geom_area()”.
Note: when iteratively adding layers to a plot using a loop (e.g. adding one line at a time to an existing plot), you need to use double quotation marks. Otherwise, ggplot creates the plot only once the loop is finished, which may lead to unexpected behavior. See this post: https://stackoverflow.com/questions/52671480/adding-multiple-ribbons-in-ggplot.
Sometimes, it is inevitable that some code returns an error under some conditions, but still we want the code to continue running, i.e. we attempt at doing thing 1, but if it doesn’t work, we attempt thing 2. We can use “tryCatch” for such code-chunks:
# try computing log-determinant of mMyMatrix, and specify what to do under
# error (here: return NA):
tryCatch(log(det(mMyMatrix)), error = function(e) {
NA
})
Sometimes, we want to record the time that R needs to execute a certain chunk of code:
# record the time at a certain point: (before your code)
= Sys.time()
timeStart
# ... do operations ...
# compute the time elapsed since ' timeStart' :
Sys.time() - timeStart
# To record the time needed to execute a single command/function:
system.time(rnorm(10000))
# Let R 'sleep' for 4 seconds, i.e. do nothing:
Sys.sleep(4)
Complex numbers:
# define a complex number:
= 3 + (0+2i)
b # that's why it's a good idea to avoid creating a variable 'i' yourself, as it
# would overwrite the variable 'i' that exists in R by default. Same for 'pi'.
Mod(b) #modulus of b
Re(b) #real part of b
Some useful packages:
# Download IMF data directly into R:
library(imfr)
# see all datasets:
imf_ids(TRUE)
# Have a look at database 'IFS':
= imf_codelist(database_id = "IFS")
a # See what indicators are available in it:
= imf_codes(codelist = "CL_INDICATOR_IFS")
a1 # Download specified indicators from this database for specified countries, a
# specified year range, at specified frequency:
= imf_data(database_id = "IFS", indicator = c("NGDP_R_SA_XDC", "ENDA_XDC_USD_RATE"),
m_RGDP_Q_raw country = c("AT", "AU"), start = "2021", end = "2022", freq = "Q", print_url = TRUE)
# Translate between official countrycodes:
library(countrycode)
= countrycode(vsC_cc2, "iso2c", "iso3c") #from iso2 to iso3 vsC_cc3