R is a rather flexible programming language. It allows you to easily implement methods, approaches and procedures that deviate from standard ones and are tailored to suit your specific needs. (This contrasts with e.g. Stata or EViews.)
This script is meant to give an introduction to R and to collect useful commands for different programming tasks in one place. As such, it can be used as a lookup-table. Even people experienced with a programming language regularly consult google or previously written codes as part of their coding routine, even more so when they haven’t used a programming language for a while or when switching back and forth between several languages. Not surprisingly, then, you won’t memorize much when going through this tutorial. It is only through experience that one can learn a programming language well. Nevertheless, this tutorial should facilitate the first steps.
Each of the following sections is self-contained. You’ll get most out of the tutorial by writing the commands you see here by yourself on your computer, running them, having a look at the results, and implementing simple changes to the code to see how the results change. Most of the code can be ran, but some is included only for illustration purposes; it cannot be ran since the necessary objects are not defined (e.g. some folder, dataset or excel-file).
You can install R at https://cran.r-project.org. Then, install R Studio at https://posit.co. R Studio is an interface that facilitates working with the programming language R. You can execute R-code by writing directly into the console window, located in R Studio by default at the bottom-left. However, this should be done for simple commands only and/or for experimentation. Longer codes that you would like to re-use should be written in the script window on the top-left, as it can be stored as an .R-file for later re-use. The script (or parts thereof) can be ran to preduce output in the console. To do so, select the code you’d like to run and click on the “Run”-button on the top-right of the script-window. (On a Mac, you can do so even easier by selecting the code and pressing cmd+enter.) Objects you created (variables, matrices, plots, functions, etc.) are shown in the workspace/environment window on the top-right. The window on the bottom-right serves several purposes; it displays plots, contains a list of installed packages (and can be used to install new packages), and it shows the documentation page when help about a command is requested.
Include these two commands at the beginning of your script. The first deletes everything in the workspace/environment, the second deletes everything in the console.
rm(list = ls()) # delete everything in workspace/environment
cat("\f") # delete everything in console
To run a chunk of code, select it and press “cmd+enter” on a Mac or click on the “Run” button on top of the script-window.
Comments are chunks of code (or text) that R ignores when running. They start with a hashtag, as above. To comment-out a chunk of code or text, on a Mac, select it and press “cmd+shift+c”.
To request help with a command, type the command with a question mark in front in the console. For example, “?sqrt” shows the help page for the command “sqrt”.
To create an object (a variable) called “a” that is equal to three, use one of the following three commands:
= 3
a <- 3
a assign("a", 3)
In the workspace window, you can see that a variable was created. To display it in the console, type
show(a) #or:
## [1] 3
print(a)
## [1] 3
To delete it, type
rm(a) # rm stands for remove
You can perform basic arithmetic operations with numbers as well as defined objects/variables:
3 + 8
## [1] 11
7^2
## [1] 49
= 3
a = 8
b + b # displays the result in console a
## [1] 11
= a + b # saves the result to a new variable, 'c' c
Many standardly used functions are already implemented by default in R. For example:
sqrt(a) # square-root
log(a) # natural log (ln)
exp(a) # exponential
# trigonometric functions:
cos(a)
sin(a)
tan(a)
ceiling(3.4) # returns 4, the smallest integer larger than 3.4
floor(3.4) # returns 3, the largest integer smaller than 3.4
%%a #computes the modulus; remainder of b/a b
One can easily combine functions:
sqrt(ceiling(8.9))
log(exp(a))
To round a number to a desired number of digits:
round(342.347, digits = 2)
A variable can also take on the value “Inf” or “-Inf” (positive or negative infinity). Any further operation with this variable gives again “Inf” or “-Inf”. For example,
= Inf
a
sqrt(a) # gives Inf
-Inf + 4 # gives -Inf
A variable can also take on the value “NA” or “NaN” (not-a-number). (This happens, for example, if one loads an excel file with an empty cell into R, as the value in such a cell needs to be distinguished from, say, zero.) Again, any operation with an NA returns NA:
= NA
a
sqrt(a) # gives NA
NA + 4 # gives NA
(The same results are obtained when a is “NaN” instead of “NA”.)
Objects you create are stored by R as different “types”. Numbers like the ones created above are of type “double” and “numeric”:
= 3
a = 8
b
typeof(a) # tells you the type of variable/object 'a'
## [1] "double"
is.double(a) # verifies whether a is of type double
## [1] TRUE
is.numeric(a) # verifies whether a is a numeric
## [1] TRUE
The difference between a double and a numeric is typically irrelevant. (You can think of “numeric” being a larger category that encompasses both “double” and “integer” objects.)
# to verify whether a is Inf or -Inf:
is.infinite(a)
# to verify whether a is NA:
is.na(a)
# to verify whether a is NaN:
is.nan(a)
Throughout this tutorial, we’ll discuss the types of different objects we create. It is important to be aware that there are different types that R uses to store objects because sometimes a command only works for a specific type (or gives different results for different types), in which case one needs to convert the object to the right type.
A logical is an object-type that can take on the values “TRUE”, “FALSE” or “NA”. It indicates whether some condition is true or false (or cannot even be evaluated).
Logicals are (usually) created using logical operators:
= a > b
isAlargerB show(isAlargerB)
## [1] FALSE
Other logical operators:
< b
a >= b
a <= b
a == b # equal to
a != b # not equal to a
Comparisons with Inf and NA:
< Inf # comparisons with Inf work
a < NA # but comparisons with NA yield NA a
Logicals can be used in standard computations, whereby “TRUE” indicates a 1 and “FALSE” indicates a 0:
TRUE + 3 # gives 4
TRUE == 1 # gives TRUE
Nevertheless, the “double”-object 1 and the “logical”-object TRUE are not the same thing because their type differs, i.e. R stores them differently:
= TRUE
a = 1
b
typeof(a)
## [1] "logical"
typeof(b)
## [1] "double"
To check whether some object is a logical or a double, type:
<- 1
isAlargerC
typeof(isAlargerB)
typeof(isAlargerC)
is.logical(isAlargerB)
is.logical(isAlargerC)
is.double(isAlargerC)
To convert numbers of type double to logicals or vice versa:
as.logical(1) # convert number/double 1 into logical TRUE
as.double(TRUE) # opposite: convert logical TRUE into number/double 1
as.numeric(TRUE) # same
Based on existing logicals (i.e. variables indicating whether a condition is true or false), further logicals can be created that combine these underlying conditions using special operators. For example:
= TRUE
isXtrue = FALSE
isYtrue
# negation of a logical: turns TRUE into FALSE (and FALSE into TRUE)
= !isXtrue
isXnottrue
# gives true only if both are true:
= isXtrue & isYtrue
areBothTrue # gives TRUE if only one of them is true and also if both are true:
= isXtrue | isYtrue
isOneTrue # gives TRUE only if onle one of them is true (it gives FALSE if both are true)
= xor(isXtrue, isYtrue) isExactlyOneTrue
(Note that these commands work even if you defined isXtrue and isYtrue as the double-types 1 and 0, respectively. Therefore, in this particular case, the object type is irrelevant.)
Further below, we will discuss vectors of logicals (i.e. vectors whose elements are logicals rather than numbers (of type double)). Moreover, logicals will be important when we talk about indexing different objects in which data is stored (a vector, matrix, array, list or dataframe).
A string is an object that contains text. It is stored as type “character” in R.
= "banana"
a typeof(a) # returns 'character'
## [1] "character"
is.character(a)
## [1] TRUE
There are many commands that are specific to strings:
nchar(a) # number of characters in string
substr(a, 1, 3) # returns first three characters in string
# String Conversion:
as.character(3) # convert the number 3 to the string '3'
as.double("3") # opposite: convert the string '3' to the number (double) 3
as.numeric("3") # same
format(3, nsmall = 2) # convert the number 3 to the string '3.00'
# The function 'paste()' creates a string by combining a number (double) and a
# string, e.g.:
paste(1, "mouse", sep = ".") #returns '1.mouse'
paste(1, "mouse") # if option 'sep' is not specified, a space is put as separator
# String modification:
gsub("i", "X", "Mississippi") #replace i with X
Strings allow for advanced printing commands, beyond “print()”:
print(a)
print(paste(a, "ice cream"))
# prints the specified string, inserting the number at '%s':
sprintf("C(X) = [0,%s]", 3.45)
sprintf("%.0f", 1e+08) # to avoid scientific notation
cat("hello\nworld\n") # prints the string, adding line-breaks at \n
Further below, we will discuss vectors of strings (i.e. vectors whose elements are strings rather than numbers (of type double) or logicals). Moreover, strings will be important when we talk about indexing different objects in which data is stored (a vector, matrix, array, list or dataframe). They will also be important when we talk about representing dates.
R is an open-source language, which means that there are many developers who write their own packages that define specific commands (functions) not included in R by default. To highlight which commands belong to which packages, we will load the packages only once these commands are needed. However, in actual applications, it is best practice to have an initial section where you load all packages. To load a package, it first needs to be installed. You can install a package by using the “Packages” panel in bottom-right window, or by typing e.g.
install.packages("ggplot2") # to install package 'ggplot2'
A package needs to be installed only once on a given computer (though you may want to re-install packages when a new version is available). In contrast, you have to load the package every time a script is ran anew. To load the package, type
library(ggplot2) # to load package 'ggplot2'
Up to now, we used functions that are pre-programmed in R or contained in a package we load. The user can of course also define own functions. For example:
= function(x, y) {
fMultiplyXY * y
x
}
fMultiplyXY(3, 4)
## [1] 12
When a function uses many arguments, it can be hard to remember the argument order. In that case, it is useful to specify the argument names when calling the function, because then the order becomes irrelevant:
# Both give same result:
fMultiplyXY(x = 3, y = 4)
fMultiplyXY(y = 4, x = 3)
One can also specify some arguments to be optional. For example:
# Here, the third argument ('z') is optional. By default, the function takes
# z=2.
= function(x, y, z = 2) {
fMultiplyXYZ * y * z
x
}
fMultiplyXYZ(2, 3) # uses z=2
fMultiplyXYZ(2, 3, 2) # also uses z=2
fMultiplyXYZ(2, 3, 4) # uses z=4
Vectors allow us to store many scalar-elements in a single object. Thereby, it is important that all scalars are of the same type (double, logical or string). Later, we will discuss lists, which do not have this requirement.
There are many ways to create a vector in R:
= 1:3 # or:
vx = c(1, 2, 3)
vx
show(vx)
## [1] 1 2 3
By default, these commands create a vertical vector (3x1 in these examples). This is not apparent when displaying the vector, but you realize it when performing matrix algebra (see below).
To transpose the vector vx, type:
t(vx)
## [,1] [,2] [,3]
## [1,] 1 2 3
Note that this vector is of type “double” (and not of type “logical” or “character” (i.e. string)):
# This vector with
typeof(vx)
## [1] "double"
is.double(vx)
## [1] TRUE
is.logical(vx)
## [1] FALSE
is.character(vx)
## [1] FALSE
Nevertheless, it is also a vector (as opposed to a matrix), and it is also of type “numeric” (the difference between a double and a numeric is typically irrelevant):
is.vector(vx)
## [1] TRUE
is.matrix(vx)
## [1] FALSE
is.numeric(vx)
## [1] TRUE
We can analogously also create vectors of strings or logicals:
= c("a", "b") # vector of strings
vy
= c(TRUE, FALSE) # vector of logicals
vz
typeof(vy)
## [1] "character"
typeof(vz)
## [1] "logical"
# both give true:
is.vector(vy)
is.vector(vz)
# both give false:
is.matrix(vy)
is.matrix(vz)
# both give false:
is.numeric(vy)
is.numeric(vz)
# to go back and forth between double and logicals, as before:
as.logical(c(1, 0))
as.double(c(TRUE, FALSE))
as.numeric(c(TRUE, FALSE))
# to go back and forth between double and strings, as before:
as.character(c(1, 3))
as.double(c("1", "3"))
as.numeric(c("1", "3"))
Advanced ways to create vectors:
seq(-4, 4, by = 0.8) # a vector going from -4 to 4 in increments of 0.8
seq(-4, 4, length.out = 5) # a vector going from -4 to 4 with 5 elements
rep(1, 5) # (5x1)-vector of ones
rep(TRUE, 5) # (5x1)-vector with elements TRUE
rep("a", 5) # (5x1)-vector with elements 'a'
rep(vx, times = 2) # create new vector by taking vector 'vx' two times
# (analogously for vectors of strings or logicals)
rep(vx, each = 2) # create new vector by taking each element of 'vx' two times
# (analogously for vectors of strings or logicals)
rev(vx) # reverse order of elements in vx ('flip vx around')
# (analogously for vectors of strings or logicals)
= c(3, 2, 5, 1)
vy c(vx, vy) # create new vector by stacking two vectors
# (analogously for vectors of strings or logicals)
We can perform standard algebraic operations using vectors:
= c(1, 4, 3)
vx = c(1, 5, 7)
vy
+ vy
vx - vy
vx
/3
vy
* vy # element-wise multiplication
vx
%*% t(vy) # matrix multiplication
vx
# Careful: these are not the same:
2 * 1:6
2 * 1):6 (
We can also use vectors to conduct comparisons, yielding logicals:
# Element-wise comparison
>= 2 # compares each element in vx to 2, returns vector of logicals
vx
>= vy # compares each element in vx to corresponding element in vy, returns vector of logicals
vx
# Overall comparison of two vectors
identical(vx, vy) # returns single logical
Many of the functions we applied to scalars can also be applied to vectors. (This is not surprising once you realize that R treats a number like a 1x1-vector; type “is.vector(3)” to see.) When applied to a vector, these functions are applied to each element individually:
sqrt(vx) # compute square root of each element in vx
exp(vx)
log(vx)
cos(vx)
There are many functions that make sense only for vectors (not numbers):
= c(1.2, 3.3, 4.2, 2.7, 5)
vx = c(3.4, 5.5, 3.2, 8.9, 0.8)
vy
length(vx) #number of elements
min(vx) #minimum
max(vx) #maximum
range(vx) #(min, max)
sum(vx) #sum of elements
prod(vx) #product of elements
mean(vx)
median(vx)
sd(vx) # standard deviation
var(vx) # variance
cov(vx, vy) # covariance of two vectors
cor(vx, vy) # correlation of two vectors
# the following two require the package 'timeDate':
library(timeDate)
kurtosis(vx)
skewness(vx)
# computing percentiles: (makes only sense for longer vectors, as it supposes
# that the entries are draws from some underlying continuous distribution)
quantile(vx)
quantile(vx, probs = 0.95) #95th percentile
quantile(vx, seq(0, 1, 0.125))
A vector can contain NA elements. In this case, the above functions return NA. However, one can tell the functions to ignore the NA-entries:
= c(1.2, NA, 4.2, 2.7, 5)
vx
mean(vx) # returns NA
mean(vx, na.rm = TRUE) # returns mean of remaining elements, ignoring NA-entries
CAUTION! Some functions in R do not prevent you from performing operations with two vectors of different length, but they recycle the shorter one so that the two match in length:
= c(1, 3, 4, 2)
vx = c(1, 3, 5)
vy
# These all give nonsensical output:
+ vy vx
## [1] 2 6 9 3
== vy vx
## [1] TRUE TRUE FALSE FALSE
== c(1, 2) vx
## [1] TRUE FALSE FALSE TRUE
# However, some (smart) functions do throw an error when applied to vectors of
# different length, e.g.:
cor(vx, vy)
## Error in cor(vx, vy): incompatible dimensions
We can access a subset of a vector. This is referred to as “indexing”, and there are different ways to do so.
= c(5, 1, 7) vx
“Direct” indexing:
2] # take second element
vx[
-2] # take all but second element
vx[
c(1, 3)] # take first and third element
vx[
-c(1, 3)] # take all but first and third element vx[
Indexing using logicals:
c(FALSE, TRUE, FALSE)] # take again second element (all but first and third element)
vx[
>= 2] # take all elements larger or equal to 2 vx[vx
Indexing using entry names (=strings):
names(vx) = c("a", "b", "c") # name vector entries
"b"] # take again second element (= element 'b')
vx[
c("a", "c")] # take first and third element vx[
Using indexing, we can change specific entries of a vector, leaving the rest untouched:
2] = 4 # change second element to 4
vx[
c(2, 3)] = c(3, 9) # change second and third elements to 3 and 9, respectively
vx[# i.e. change sub-vector containing second and third element to vector c(3,9)
= vx[-1] # delete first element vx
= c(2, 1, 3, 2, 5)
vx = c(2, 3, 4, 1, 5)
vy
sort(vx) # sorts in increasing order
sort(vx, decreasing = TRUE) #sorts in decreasing order
order(vx) # like sort(), but returns indices of sorted vector
order(vx, decreasing = TRUE)
# note: with strings, sort and order work alphabetically
Identify elements that match certain condition in a single vector:
which(vx == 2) # returns the indices of all entries equal to 2
max(which(vx == 2)) # finds largest index (last element) that matches condition
which.max(vx) # returns index of maximum (largest entry)
which.min(vx) # returns index of minimum (smallest entry)
Identify elements from one vector in another vector:
# Are elements from vy in vx? returns vector of logicals of the length of vy
%in% vx
vy
# Finds location of elements from vy in vx:
match(vy, vx)
# (returns NA if an element does not exist in vx)
We can also perform set operations with vectors, as they denote essentially a set of elements (numbers, logicals or strings):
= c(2, 8, 4, 7, 5)
vz
union(vx, vy)
intersect(vx, vy) # intersection of two sets (vectors)
Reduce(intersect, list(vx, vy, vz)) #intersection of three sets (vectors)
setdiff(vx, vy) # 'vx complement vy'; all elements in vx that are not in vy
# here the order matters!
String creation and modification:
# Previously, we introduced the function 'paste()'. It creates a string by
# combining a number (double) and a string, e.g.:
paste(1, "mouse", sep = ".")
paste(1, "mouse")
# paste() also works on vectors, returning vector of strings:
paste(1:3, "mouse", sep = ".") # returns '1.2.3'
paste(1:3, collapse = ".") # returns '1.2.3'
paste(1:3, collapse = "_") # returns '1_2_3'
paste(1:3, collapse = "+") # returns '1+2+3'
# When we split strings, we obtain a vector of strings. e.g. split a string at
# points where 'i' is located:
strsplit("Mississippi", "i")
Identification of elements in strings:
= c("Zimbabwe", "Cameroon", "Kenya", "Rwanda", "Djibouti")
vsCountries # return logical-vector saying which string-entries contain 'w':
grepl("w", vsCountries)
# this can be used for indexing:
grepl("w", vsCountries)] vsCountries[
Working with dates:
library(lubridate)
<- "2020-04-01"
sDate
year(sDate) # show year corresponding to the date in sDate
## [1] 2020
month(sDate) # show month
## [1] 4
day(sDate) # show daay
## [1] 1
ymd("2012-03-26") # show year, month and day
## [1] "2012-03-26"
# Note that the date needs to be in format 'YYYY-MM-DD'. e.g. this gives
# error:
year("01-04-2020")
## [1] 1
# To convert another format to this format, use
<- as.Date("01.04.2020", "%d.%m.%Y")
sNewDate # or:
<- as.Date("01-04-2020", format = "%d-%m-%Y")
sNewDate # or:
<- as.POSIXlt("01-04-2020", format = "%d-%m-%Y")
sNewDate
year(sNewDate)
## [1] 2020
# Identify weeks from a vector of dates; for each date in vDates, find the
# preceding Monday:
= paste("2020-04-", 10:30, sep = "")
vDates cut(as.Date(vDates), "week")
## [1] 2020-04-06 2020-04-06 2020-04-06 2020-04-13 2020-04-13 2020-04-13
## [7] 2020-04-13 2020-04-13 2020-04-13 2020-04-13 2020-04-20 2020-04-20
## [13] 2020-04-20 2020-04-20 2020-04-20 2020-04-20 2020-04-20 2020-04-27
## [19] 2020-04-27 2020-04-27 2020-04-27
## Levels: 2020-04-06 2020-04-13 2020-04-20 2020-04-27
# Get month-abbreviations from January to April:
1:4] month.abb[
## [1] "Jan" "Feb" "Mar" "Apr"
Some more useful things for working with dates will be discussed later when talking about time series methods and about plotting.
Like vectors, matrices store a collection of scalar-elements of the same type (double, logicals or strings). Unlike vectors, they do so along two (rather than one) dimension.
= matrix(NA, ncol = 3, nrow = 2) # matrix of NAs (to be filled-in)
mA
# create a matrix by filling-in elements of a vector:
= c(4.3, 5, 6, 3, 4.5, 9.2)
vx # by column:
<- matrix(vx, ncol = 3, nrow = 2)
mA # by row:
<- matrix(vx, ncol = 3, nrow = 2, byrow = TRUE) mA
As you can verify, mA is of type “double” (and “numeric”) as opposed to “logical” or “character” (string):
typeof(mA)
is.double(mA)
is.numeric(mA)
It is also a matrix rather than a vector:
is.matrix(mA)
is.vector(mA)
We can use the following command to vectorize the matrix mA (i.e. turn it into a vector by stacking its columns on top of each other) or to ensure that a \(n \times 1\) matrix is treated as a vector by R:
as.vector(mA)
## [1] 4.3 3.0 5.0 4.5 6.0 9.2
<- matrix(1:4, ncol = 3, nrow = 4)
mB is.vector(mB)
## [1] FALSE
<- as.vector(mB)
mB_ is.vector(mB_)
## [1] TRUE
Further ways to create a matrix:
diag(3) # 3x3 identity matrix
diag(c(3, 2)) # 2x2 diagonal matrix with 3 and 2 on diagonal
# But: Note: when applied to an existing, symmetric matrix, diag() returns its
# diagonal:
= matrix(1:9, ncol = 3, nrow = 3)
mA diag(mA)
# Create a matrix by putting the same vector in each column:
= c(1, 5, 3, 6)
vx = replicate(5, vx)
mA
# Create a matrix as the Cartesian product of two vectors:
= seq(0, 0.1, 0.02)
vx = seq(0.1, 0.3, 0.1)
vy = expand.grid(column1 = vx, column2 = vy) mB
To display the matrix, you can use the commands “show” and “print”, as before. Also, in contrast to scalars and vectors, you can click on the matrix in the environment window, which opens up a separate panel in the script-window to show you the matrix.
We can, in principle also create a matrix of logicals or strings:
matrix(c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE), ncol = 3, nrow = 2)
matrix(c("a", "b", "c", "d", "e", "f"), ncol = 3, nrow = 4)
replicate(3, c(TRUE, FALSE))
replicate(3, c("a", "b"))
This is rarely done in practice. However, matrices of logicals or strings can arise when applying a function to a matrix (of numbers). For example, matrices of logicals arise when comparing elements of a matrix to some number.
= matrix(1:9, ncol = 3, nrow = 3)
mA
dim(mA) #display number of rows and columns of matrix mA
nrow(mA)
ncol(mA)
colnames(mA) = c("a", "b", "c") #label columns of matrix
rownames(mA) = c("r1", "r2", "r3")
# Again, three ways of indexing possible:
# 1. 'Direct'
1, 2] # access element (1,2)
mA[
1, ] # access first row, returns vector
mA[
1] # access first column, returns vector
mA[,
c(1, 2), c(1, 3)] # access specified rows and columns
mA[
# 2. Using logicals:
c(TRUE, FALSE, FALSE), c(FALSE, TRUE, FALSE)]
mA[
c(TRUE, FALSE, FALSE), ]
mA[
# 3. Using row/column names:
"r1", "b"]
mA["r1", ]
mA[
# And, with two dimensions available, we can mix these indexing approaches:
1, "b"]
mA[
c(TRUE, FALSE, FALSE), "b"]
mA[
# ...
= matrix(c(1, 3, 8, 8), ncol = 2, nrow = 2)
mA = matrix(c(1, 2, 7, 1), ncol = 2, nrow = 2)
mB
* mB # element-wise multiplication
mA
%*% mB # matrix multiplication
mA
t(mA) # transpose
solve(mA) # inverse
solve(mA, tol = 1e-23) # if almost singular
# often, need to convert : as.numeric()
library(Matrix)
rankMatrix(mA) # compute rank
eigen(mA) # compute eigenvalues and eigenvectors;
eigen(mA)$values
eigen(mA)$vectors
Many of the functions applied above to numbers and vectors can also be applied to matrices: e.g.
sqrt(mA) # compute square-root for each element of matrix
min(mA) # compute minimum out of all elements of matrix
mean(mA) # mean of all elements in matrix
>= 2 # element-wise comparison, returns matrix of logicals mA
Some functions only make sense when their single input is a matrix (not a vector or a scalar):
cov(mA) # covariance between columns
cor(mA) # correlation between columns
A particularly useful function for matrices is “apply”. It allows us e.g. to apply a function designed for a vector to each column or row of a matrix separately:
= matrix(c(1, 3, 2, 8), ncol = 2, nrow = 2)
mA
# for each column (=2nd dimension) of mA, compute the mean of elements:
apply(mA, 2, mean)
# (i.e. apply to object mA, on dimension 2 (i.e. by columns), the function
# 'mean'
# One can also supply a self-written function to 'apply':
= function(vx) {
fDemean - mean(vx)
vx
}apply(mA, 2, fDemean)
# or, in short:
apply(mA, 2, function(vx) {
- mean(vx)
vx
})
# This is useful even for functions applied to each element of a matrix:
apply(mA, c(1, 2), function(x) {
min(2, x)
# since min(mA,x) doesn't give the hoped for result })
Arrays generalize matrices, as they store a collection of scalar-elements of the same type (double, logicals or strings) along a general number of dimensions (more than 2). They are useful for efficiently storing data that differ along several dimensions (e.g. countries, years, variables, …) and accessing subsets of it.
# Create three-dimensional array of NAs:
= array(dim = c(2, 3, 2))
aY
# Create three-dimensional array by using a vector to 'fill it up':
= c(1:12)
aX dim(aX) = c(2, 3, 2)
print(aX)
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
Indexing of arrays:
1, 3, 2]
aX[c(2:3), 1]
aX[,
1, c(2:3), ] # returns matrix (2 x 2)
aX[1, c(2:3), , drop = FALSE] # returns array (1 x 2 x 2)
aX[
# Can index similarly by using logical-vectors for each dimension, or by using
# the names of the elements in the different dimensions:
# Name dimensions of array:
dimnames(aX)[[1]] = c("r1", "r2")
dimnames(aX)[[2]] = c("a", "b", "c")
dimnames(aX)[[3]] = c("A", "B")
Functions that operate on scalars can be used on arrays as well:
# Functions:
sqrt(aX) # compute square-root of each element
There are (to my knowledge) no built-in functions that require a single array as an argument.
We can use the function “apply” to apply a function along different dimensions of an array:
# for each row and column, compute sum along third dimension:
apply(aX, c(1, 2), sum)
# for each row, compute minimum of resulting matrix:
apply(aX, 1, min)
A conditional is a piece of code that only gets evaluated if a certain condition is true:
= 5
a
if (a < 2) {
print("a is smaller than 2")
}
if (a < 2) {
print("a is smaller than 2")
else if (a > 2) {
} print("a is larger than 2")
}
## [1] "a is larger than 2"
if (a == 2) {
print("a equals 2")
else if (a == 3) {
} print("a does not equal 2 but 3")
else {
} print("a neither equals 2 nor 3")
}
## [1] "a neither equals 2 nor 3"
A for loop is a chunk of code that gets executed many times, each time for a different value of a so-called “index-variable”, named “ii” in the following. It is useful if you need to repeat a procedure for, say, different countries in your data sample.
= 0
a for (ii in 1:10) {
= a + ii
a
}print(a)
## [1] 55
for (ii in c("a", "b")) {
print(ii)
}
## [1] "a"
## [1] "b"
# Note that 1:10 and ' c('a','b') ' are both vectors. We can also use the othe
# ways to create vectors from above, e.g.:
for (ii in seq(-3, 3, by = 1.5)) {
print(ii)
}
## [1] -3
## [1] -1.5
## [1] 0
## [1] 1.5
## [1] 3
A while loop keeps executing a chunk of code only as long as some condition is true:
while (a < 130) {
= a + 3
a
}print(a)
## [1] 130
Useful commands for working with for- and while-loops are “next” and “break()”. The former tells R to stop executing the code for the current iteration and proceed to the next one, while the latter tells R to stop the for- or while-loop altogether.
= 0
a for (ii in 1:10) {
if (ii%%2 == 0) {
# if ii is even (modulus of ii/2 is null), go to next iteration (i.e.
# skip this one)
next
}= a + ii
a
}print(a)
## [1] 25
while (a < 130) {
= a + 3
a
if (a == 88 | a == 89) {
break)()
(
}
}print(a)
## [1] 88
A list is like a vector whose elements can be of different types. For example:
# Create list:
list(1, 2, 3, "a")
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] "a"
# In contrast, this creates a vector of characters; i.e. each number is
# converted to a character
c(1, 2, 3, "a")
## [1] "1" "2" "3" "a"
= list(1, "a", c(1:3))
lMyList
typeof(lMyList)
## [1] "list"
is.list(lMyList)
## [1] TRUE
Lists are useful for saving a collection of objects, like matrices or plots, into one object, rather than creating many separate ones. They are also useful for keeping the workspace tidy in longer codes. Both of these things are discussed below under “Workspace Management”.
Indexing:
# Again, as before with vectors, there are three ways to index this list:
# 'Directly'
3] # take third element
lMyList[
# Using logicals:
c(FALSE, FALSE, TRUE)]
lMyList[
# Using named elements:
names(lMyList) = c("first", "second", "third") # name the elements in list
"third"]
lMyList[
names(lMyList) # return the names
Modify list:
# change first element:
1] = 2
lMyList[
# add (append) element to list:
append(lMyList, "7")
Another important object in R is a dataframe. Just like a list is akin to a vector that can contain elements of different types, a dataframe is akin to a matrix that can contain columns of different types. This makes dataframes suitable to store datasets:
= data.frame(a = c(10, 20, 30), b = c("RA", "RB", "RC"))
dfA
print(dfA)
## a b
## 1 10 RA
## 2 20 RB
## 3 30 RC
# note that dataframes are lists:
typeof(dfA)
## [1] "list"
# but not every list is a data frame:
is.data.frame(dfA)
## [1] TRUE
is.data.frame(lMyList)
## [1] FALSE
# Access column-/row-names:
colnames(dfA)
rownames(dfA)
# Modify row names:
rownames(dfA) = c("first", "second", "third")
# How many rows & columns do we have?
dim(dfA)
nrow(dfA)
dim(dfA)[1] # same
ncol(dfA)
dim(dfA)[2] #same
Indexing dataframes is akin to indexing matrices:
# access second column, i.e. column 'b':
2]
dfA[,
"b"] # same
dfA[,
$b # same
dfA# note, if column had a number as its name, would need to write e.g.
# dfA$`2023`; that's why it's a good idea to stick to strings as row and column
# names (use e.g. 'y2023')
c(FALSE, TRUE)] # same
dfA[,
# access element (3,2), i.e. third element in column 'b', i.e. element in row
# 'third' and column 'b':
3, 2]
dfA[
"third", "b"] # same
dfA[
3, "b"] # same
dfA[
$b[3] # same
dfA
3, c(FALSE, TRUE)] #same
dfA[
c(FALSE, FALSE, TRUE), c(FALSE, TRUE)] # same
dfA[
# (and many more possibilities!)
Modification:
# Adding/deleting rows:
4, ] = c(9, "RA") # add row
dfA[rownames(dfA)[nrow(dfA)] = "fourth" # name the newly added row
= rbind(dfA, c(5, "RC")) # same; add another row
dfA
= dfA[-5, ] # delete row 5
dfA = dfA[-4, ] # delete row 4
dfA
# Adding/deleting columns:
"c"] = c(8, 9, 4) # add a column 'c'
dfA[,
$d = c(1, 0, 1) # add a column 'd'
dfA
= data.frame(dfA, e = c(3, 3, 4)) # add a column 'e'
dfA
= cbind(dfA, c(3, 3, 4)) # add another column (yet unnamed) ...
dfA colnames(dfA)[ncol(dfA)] = "f" # ... and name it 'f'
# delete column 'd' by keeping all others:
c("a", "b", "c")]
dfA[, subset(dfA, select = c(a, b, c)) # same
colnames(dfA) %in% c("a", "b", "c")] # same
dfA[,
# delete column 'd' explicitly:
subset(dfA, select = -c(d))
!colnames(dfA) %in% c("d")]
dfA[, # (note that writing ' dfA[,-c('c','d')] ' gives an error)
To introduce further commands to deal with dataframes, we load a dataset (dataframe) that is pre-stored in R. It contains various measures for 32 different cars.
# Load dataset:
data(mtcars)
# Describe data (i.e. each column/variable):
summary(mtcars)
# alternative:
library(psych)
describe(mtcars)
# Store column-/row-names as vectors:
= colnames(mtcars)
vsVariables = rownames(mtcars) vsCars
mean(mtcars$hp) # compute mean horse power
min(mtcars$hp) # compute minimum horse power
# How many cars have an hp higher than average?
# step 1: create vector of logicals, indicating for each car whether its hp is
# above average step 2: sum up its elements (number of ones/TRUEs)
= mtcars$hp > mean(mtcars$hp)
vIsHPhigh sum(vIsHPhigh)
# alternative: create a sub-dataframe containing only the cars with
# above-average hp, and count its rows:
nrow(mtcars[vIsHPhigh, ])
# Find cars with hp above 190 and 5 gears:
$hp > 190 & mtcars$gear == 5, ]
mtcars[mtcars# if interested only in their number of cylinders and horse power:
$hp > 190 & mtcars$gear == 5, c("cyl", "hp")]
mtcars[mtcars
# Find cars with 4 or 6 cylinders and hp above 120:
= mtcars$cyl %in% c(4, 6)
vRightCyl = mtcars$hp > 120
vRightHP & vRightHP, ]
mtcars[vRightCyl
# Find all Mercedes (their names start with 'Merc'):
grepl("Merc", vsCars), ]
mtcars[
# Sort dataset first according to number of cylinders, then according to
# horsepower:
order(mtcars$cyl, mtcars$hp), ]
mtcars[
# Create dummies for each value of 'cyl' and 'gear', and add to dataframe:
library(fastDummies)
= dummy_cols(mtcars, select_columns = c("cyl", "gear")) data
Let’s see how to merge different datasets, transform them into different formats and apply a function to different observations at once (different units, years, etc.). For this, consider the following example involving (made up) grades for famous basketball players:
<- data.frame(familyName = c("Jokic", "Doncic", "Jovic", "Bogdanovic",
dfExamMidterm2022 "Bogdanovic"), firstName = c("Nikola", "Luka", "Nikola", "Bogdan", "Bojan"),
year = rep(2022, 5), grade1 = c(4.5, 3, 4.5, 6, 5))
<- data.frame(familyName = c("Jokic", "Doncic", "Jovic", "Bogdanovic",
dfExamFinal2022 "Bogdanovic"), firstName = c("Nikola", "Luka", "Nikola", "Bogdan", "Bojan"),
year = rep(2022, 5), grade2 = c(5, 3.5, 5, 5.5, 4.5))
<- data.frame(familyName = c("Jokic", "Doncic", "Jovic", "Bogdanovic"),
dfExamMidterm2023 firstName = c("Nikola", "Luka", "Nikola", "Bogdan"), year = rep(2023, 4), grade1 = c(5.5,
4, 4, 4.5))
# Combine datasets:
# Could combine two datasets with the same observations (rows) but different
# variables (columns) by 'cbind':
cbind(dfExamMidterm2022, grade2 = dfExamFinal2022$grade2)
# but these don't work if one of them has different observations (rows):
cbind(dfExamMidterm2022, grade2 = dfExamFinal2022$grade2, grade3 = dfExamMidterm2023$grade1)
## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 5, 4
# Can combine datasets also using the command 'merge', which requires two
# datasets (not three):
= merge(dfExamMidterm2022, dfExamFinal2022)
dfExams2022
= merge(dfExams2022, dfExamMidterm2023, all = TRUE) # all=TRUE ensures that we keep all rows (observations)
dfExamsAll # without this option, we would only keep observations that exist in both
# dataframes:
merge(dfExams2022, dfExamMidterm2023) # (in this case, there are no such observations)
= merge(dfExamMidterm2022, dfExamMidterm2023, all = TRUE) dfExamsMidterms
# Reshape dataset from/to long/short format:
library(reshape2) # contains commands melt and dcast
# dfExams2022 is in so-called 'short format'.
# put dfExams2022 into long format:
= melt(dfExams2022, id.vars = c("familyName", "firstName", "year"),
dfExams2022_long measure.vars = c("grade1", "grade2"))
# note that any variable not specified as id-variable or measure-variable gets
# dropped:
melt(dfExams2022, id.vars = c("familyName", "firstName"), measure.vars = c("grade1",
"grade2"))
# this can create trouble if the specified id-variables do not perfectly define
# an observation:
melt(dfExams2022, id.vars = c("familyName"), measure.vars = c("grade1", "grade2"))
# put dfExams2022_long back into short format:
= dcast(dfExams2022_long, firstName + familyName + year ~ variable)
dfExams2022_short
# dfExamsMidterms is in 'long format'.
# put it into short format:
= dcast(dfExamsMidterms, firstName + familyName ~ year)
dfExamsMidterms_short # note that the name of the variable 'grade1' got lost, but we typically know
# what variable we are looking at
# put it back into long format:
= melt(dfExamsMidterms_short, id.vars = c("familyName", "firstName"),
dfExamsMidterms_long measure.vars = c("2022", "2023"))
# compared to the original dfExamsMidterms, this dataframe has an entry for the
# player who didn't take the exam in 2023: Bojan Bogdanovic. it also has
# different column names, but we can supply the original ones again:
colnames(dfExamsMidterms_long)[3:4] = c("year", "grade1")
# dfExamsAll is in 'short format' regarding the grades, but in 'long format'
# regarding the years.
# put it into short format with years across columns, for midterm grades:
= dcast(dfExamsAll, firstName + familyName ~ year, value.var = "grade1")
dfExamsMidterms # for final grades:
= dcast(dfExamsAll, firstName + familyName ~ year, value.var = "grade2")
dfExamsFinals # (can do that only using one variable at once)
# put it into long format with grade-types across rows:
= melt(dfExamsAll, id.vars = c("familyName", "firstName", "year"),
dfExamsAll_long measure.vars = c("grade1", "grade2"))
# put the resulting dataset into short format with years across columns:
= dcast(dfExamsAll_long, firstName + familyName + variable ~ year)
dfExamsAll_v2
# comparing dfExamsAll_v2 with dfExamsAll, we transformed a dataset with years
# across rows and grade-types across columns into one with years across columns
# and grade-types across rows
# Do something for given subset of observations:
dfExams2022_long
# for each 'familyName' and 'firstName' by rows, and for each value for 'year'
# by columns, show mean of remaining variables: (i.e. show mean of grades in
# each year by player)
acast(dfExamsAll_long, familyName + firstName ~ year, mean)
# same, but this time ignore the NA entries when computing the mean: (i.e. for
# 2023, return the midterm grade, as the final grade is not available yet)
acast(dfExamsAll_long, familyName + firstName ~ year, function(x) {
mean(x, na.rm = TRUE)
})
# compute number of observations per player and year:
acast(dfExamsAll_long, familyName + firstName ~ year, length)
# compute number of non-missings per player and year:
acast(dfExamsAll_long, familyName + firstName ~ year, function(x) {
sum(is.na(x) == FALSE)
})
# Alternative:
library(doBy)
# contains command 'summaryBy', which repeats a command for each entry of a
# variable (e.g. for each country, or each year, or both)
# do the same as above:
summaryBy(value ~ familyName + firstName + year, FUN = mean, data = dfExamsAll_long)
summaryBy(value ~ familyName + firstName + year, FUN = function(x) {
mean(x, na.rm = TRUE)
data = dfExamsAll_long)
},
summaryBy(value ~ familyName + firstName + year, FUN = length, data = dfExamsAll_long)
summaryBy(value ~ familyName + firstName + year, FUN = function(x) {
sum(is.na(x) == FALSE)
data = dfExamsAll_long) },
Deal with missing values:
# Find observations/rows with missings:
# vector of logicals indicating whether row is complete:
= complete.cases(dfExamsAll_v2)
vIsObsComplete # vector of logicals indicating whether row has missings:
= !vIsObsComplete
vObsHasMissings
# Find rows which have missings in a particular column:
= is.na(dfExamsAll_v2$`2023`)
vObsHasMissings # (if your missing is coded as '.', write ' dfExamsAll_v2$`2023` == '.' ', or
# first re-code the missings to NA)
# remove rows/observations with missings:
dfExamsAll_v2[vIsObsComplete, ]
# Find variables/columns with missings:
# vector of logicals indicating whether column has missings:
= apply(dfExamsAll_v2, 2, function(x) {
vColHasMissings any(is.na(x))
})# vector of logicals indicating whether column is complete:
= !vColHasMissings
vColIsComplete
# remove variables with missings:
dfExamsAll_v2[, vColIsComplete]
A path is the location on your computer (think of a folder) which R uses to look for datasets you call and to store things you tell it to store. Here’s how we can define and change the path (note that, in order to run this code, you need to insert paths that work for your own computer):
# Have a look at the folder (path) that R is currently in:
getwd() # 'get working directory'
# Change the path:
setwd("/Users/markomlikota/Documents/MySpecialFolder")
# If you regularly go between different paths/folders, it makes sense to define
# them first:
= "/Users/markomlikota/Documents/MySpecialFolder"
sMyPath setwd(sMyPath)
# Change the path to a sub-folder inside 'MySpecialFolder':
setwd(paste(sMyPath, "/MySubFolder", sep = ""))
# again, it might make sense to store this path:
= paste(sMyPath, "/MySubFolder", sep = "")
sMySubPath setwd(sMySubPath)
# Paths are computer-specific. However, if you collaborate with others via
# Dropbox, you can define the path relative to the Dropbox folder, so that
# everyone can use the same path definition and hence run the same code:
setwd("~/Dropbox/OurSpecialFolder")
# (This works provided that everyone has 'OurSpecialFolder' in the same
# location inside their Dropbox.)
We can also use R to create and delete folders in our path (i.e. on our computer):
# List all folders in the current path:
dir()
# Create a folder 'abc' in the current path:
dir.create("abc")
# Verify whether folder 'abc' exists in current path:
dir.exists("abc")
# Delete folder 'abc' in current path: (need computer's permission for that)
unlink("abc")
The following commands are useful for writing a dataframe to a file:
= data.frame(a = c(10, 20, 30), b = c("RA", "RB", "RC"))
dfA = data.frame(a = c(10, 20, 30), b = c("RA", "RB", "RC"))
dfB
# Write .csv files:
write.table(dfA, "myfile.csv", sep = ",")
# to preserve special encoding (e.g. cyrillic characters), use this:
library(readr)
write_excel_csv(x = dfA, file = "myfile.csv", col_names = T)
# Write .xslx files:
library(openxlsx)
write.xlsx(dfA, "myfile.xlsx")
# write several sheets: (same package)
= createWorkbook("whatever") # need to open a named workbook; name is irrelevant here
wb addWorksheet(wb, "alpha")
addWorksheet(wb, "beta")
writeData(wb, sheet = "alpha", dfA)
writeData(wb, sheet = "beta", dfB)
saveWorkbook(wb, "myfile.xlsx", overwrite = TRUE)
The following commands are useful for reading-in files (datasets) into R:
# Read in .csv file:
# simply:
= read.table("myfile.csv", sep = ",")
dfData # with more options:
= read.csv("myfile.csv", header = TRUE, sep = ",", quote = "\"", dec = ",",
dfData fill = TRUE, comment.char = "", skip = 1, nrows = 4)
# Read in .xslx file:
library(readxl)
= read_excel("myfile.xlsx", sheet = "alpha") dfData
# Read in .dta file:
library(haven)
= read_dta("myfile.dta")
myData # (this command cannot be ran, as we don't have 'myfile.dta' in our path)
# There are many packages to download data directly from some
# databases/websites. e.g. 'quantmod' to download data from yahoo or FRED
# database, or 'imfr' package for downloading data from IMF (more on them
# later)
# Download file from some url:
library(RCurl)
download.file("https://api.census.gov/data/1994/cps/basic/jan?tabulate=weight(PWCMPWGT)&col+PEEDUCA&row+PEMLR",
"./Data/1994jan", "libcurl")
In my experience, due to type conversion issues, it’s much easier to work with csv files than with excel (xlsx) files in R. So you might want to convert any xlsx files you need first manually into csv files. (You can also do basic adjustments (like deleting unnecessary rows) manually in the xlsx/csv file, though this is discouraged, as you would like others to be able to replicate your whole data-manipulation process and therefore verify your work.)
# If you have to load an xlsx file and some of its columns are not recognized
# as a numeric by R (but as a factor), this can be useful:
= sapply(dfData, is.factor)
indx = lapply(dfData[indx], function(x) as.numeric(as.character(x)))
dfData[indx]
# You might also need to convert columns of type 'character' to type 'numeric':
-1] = lapply(dfData[, -1], function(x) as.numeric(x))
dfData[, # (here it's done for all but first column, say because it contains a list of
# countrynames, which are supposed to be strings)
Statistical software like R can be used to draw random numbers from some specified distribution. These numbers are not truly random, but appear indistinguishable from actual random numbers. To make the analysis replicable, one must fix the so-called “seed”, which ensures that every time the code is run, the same quasi-random numbers are drawn.
# Fix seed:
set.seed(50) # choose some (any) number here
# ------------------------------------------------------------------------- #
# PRE-DEFINED UNIVARIATE DISTRIBUTIONS
# ------------------------------------------------------------------------- #
# Uniform distribution:
runif(3) # generate 3 (independent) draws from U(0,1)
runif(3, 0, 1) #same
runif(3, -3, 8) # generaate 3 (independent) draws from U(-3,8)
dunif(1.3, -3, 8) # evaluate pdf of U(-3,8) at x=1.3
punif(1.3, -3, 8) # evaluate cdf of U(-3,8) at x=1.3
qunif(0.2, -3, 8) # compute 20th percentile (point corresponding to cdf of 0.2)
# see ?runif for documentation for all these commands pertaining to Uniform
# disttribution
# note that these functions can also be applied to vectors
# Normal distribution:
rnorm(3) # generate 3 (independent) draws from standard Normal
rnorm(3, 2, 1) # generate 3 (independent) draws from Normal with mean 2 and standard deviation 1
dnorm(1.3, 2, 1) # return pdf at x=1.3
pnorm(1.3, 2, 1) # return cdf at x=1.3
qnorm(0.2, 2, 1) # compute 20th percentile
# hence, to get critical values for two-sided 95% level test:
qnorm(0.025) # 2.5th percentile of N(0,1)
qnorm(0.975) # 97.5th percentile of N(0,1)
# see ?rnorm for documentation for all these commands pertaining to Normal
# disttribution
# t-distribution:
rt(3, 32) # 3 draws, df=32
# dt, pt, qt exist too; see ?rt
# Chi-squared:
dchisq(3, 9) # 3 draws, df=9
# dchisq, pchisq, qchisq exist too; see ?rchisq
# Inverse Gamma:
library(invgamma)
rinvgamma(3, 10, 2) # 5 draws from IG with shape 10 and rate 2
# can also specify shape and scale; see
`?`(rinvgamma)
# dinvgamma, pinvgamma, qinvgamma exist too; see ?rinvgamma
# ------------------------------------------------------------------------- #
# PRE-DEFINED MULTIVARIATE DISTRIBUTIONS
# ------------------------------------------------------------------------- #
# Multivariate Normal:
library(mvtnorm)
= c(0, 2)
vMeans = matrix(data = c(0.4, 0.2, 0.2, 0.9), nrow = 2, ncol = 2)
mVariance rmvnorm(3, vMeans, mVariance) # 3 draws from multivariate Normal with specified mean and variance
# dmvnorm exists too; see ?rmvnorm
# Inverse Wishart:
library(MCMCpack)
= 5
nu = matrix(data = c(0.4, 0.2, 0.2, 0.9), nrow = 2, ncol = 2)
mS riwish(v = nu, S = mS) #single draw
# diwish exists too; see ?riwish
# ------------------------------------------------------------------------- #
# MULTINOMIAL DISTRIBUTION (MANUALLY DEFINED UNIVARIATE DISTRIBUTION)
# ------------------------------------------------------------------------- #
# Define a discrete distribution with the following possible realizations
# (outcomes):
= c(3.4, 5.3, 3.2)
vOutcomes
# Sample 3 times from these elements/outcomes (with replacement):
sample(vOutcomes, 3, replace = TRUE)
# by default, R assumes that all elements have equal probabilities
# alternatively supply vector of probabilities:
= c(0.2, 0.2, 0.6)
vProbabilities sample(vOutcomes, 3, replace = TRUE, prob = vProbabilities)
# Fit Kernel-density-estimate on vector 'vOutcomes':
density(vOutcomes) # assuming equal probabilities
density(vOutcomes, weights = vProbabilities) # supplying vector of probabilities
# Load data for examples:
data(mtcars)
Run a linear regression:
# Regress hp on intercept, mpg and gear:
lm(hp ~ mpg + gear, data = mtcars)
##
## Call:
## lm(formula = hp ~ mpg + gear, data = mtcars)
##
## Coefficients:
## (Intercept) mpg gear
## 249.28 -10.58 29.84
# Alternatively, we can simply use three vectors (not columns in a dataframe):
lm(mtcars$hp ~ mtcars$mpg + mtcars$gear)
##
## Call:
## lm(formula = mtcars$hp ~ mtcars$mpg + mtcars$gear)
##
## Coefficients:
## (Intercept) mtcars$mpg mtcars$gear
## 249.28 -10.58 29.84
# Doing it manually is also easy (and one can be sure what the code is doing):
= mtcars[, c("mpg", "gear")]
mX = as.matrix(cbind(rep(1, nrow(mX)), mX))
mX = mtcars$hp
mY = solve(t(mX) %*% mX) %*% (t(mX) %*% mY)
vBetaHat print(vBetaHat)
## [,1]
## rep(1, nrow(mX)) 249.28298
## mpg -10.58447
## gear 29.84493
# Drop intercept:
lm(hp ~ -1 + mpg + gear, data = mtcars)
# Include dummy for 8 cylinders:
lm(hp ~ mpg + gear + cyl == "8", data = mtcars)
# Include dummy for each value of cylinders:
lm(hp ~ mpg + gear + factor(cyl), data = mtcars)
# Add mpg-squared:
lm(hp ~ mpg + I(mpg^2) + gear, data = mtcars)
# add dummy for wt larger than mean:
lm(hp ~ -1 + mpg + gear + (wt > mean(wt)), data = mtcars)
# add as covariates all polynomials of 'mpg' and 'gear' up to order 2:
library(stats)
lm(hp ~ poly(as.matrix(mtcars[, c("mpg", "gear")]), degree = 2, raw = TRUE), data = mtcars)
Analyze the output of a linear regression:
# Re-run first regression from above, but store to object:
= lm(hp ~ -1 + mpg + gear, data = mtcars)
reg
# Extract vector of residuals:
$residuals
reg# Extract vector of estimated coefficients:
$coefficients
reg# Extract vector of fitted values:
$fitted.values
reg
# Get all sorts of summary stats:
summary(reg)
# Store this object:
= summary(reg)
mySumReg # This allows you to extract even many more statistics.
# Extract R-squared:
$r.squared
mySumReg# Extract Adjusted R-Squared:
$adj.r.squared
mySumReg# Extract matrix with estimated coefficients, SEs, t-vals and p-vals in
# columns:
$coefficients mySumReg
By default, the “lm()” command assumes homoskedasticity. The following commands can be used to adjust standard errors in case of heteroskedasticity:
library(sandwich)
# Compute adjusted variance-covariance matrix of coefficients:
= vcovHC(reg, type = "HC")
mVarHskRob # Get robust SEs of individual coefficients by taking square root of diagonal:
sqrt(diag(vcovHC(reg, type = "HC")))
# Note: type = 'HC0' has no DF correction whil type = 'HC1' has the DF
# correction. Both are valid asymptotically, but a DF correction is preferable
# in small samples.
# Print results with robust SEs:
library(lmtest)
coeftest(reg, vcov. = mVarHskRob)
# alterantively:
summary(reg, vcov = function(x) vcovHC(x, type = "HC"))
Information criteria:
# Compute Bayesian information criterion:
BIC(reg)
# Compute Schwartz information criterion (manually):
= mean(reg$residuals^2)
MSE = length(reg$coefficients) # K = # of regressors
K = length(reg$residuals) # N = # of obs
N = (K/N) * log(N) + log(MSE) SIC
The t-statistic for a t-test with the null hypothesis indicating a zero value for the parameter is conducted by default. Computing (manually) the t-statistic for testing whether a single coefficient equals some other value is trivial once the estimated parameter value and its SE are computed.
We can also run other tests manually (e.g. F-test, Wald-test, etc.). For the F-test, we can also use the following package:
# Test whether mpg and gear are jointly insignificant in our baseline
# regression, using adjusted standard errors:
library(car)
linearHypothesis(reg, test = "F", c("mpg=0", "gear=0"), vcov. = vcovHC(reg, type = "HC"))
# Manual alternative: 1. run restricted regression, store R^2 (R2r) 2. run
# unrestricted regression, store R^2 (R2u), number of regressors (k+1), number
# of restrictions (q), and compute:
= ((R2u - R2r)/q)/((1 - R2u)/(n - k - 1))
Fstat
# Apparently, this package can also be used to run Wald-tests;
`?`(linearHypothesis)
While there are packages for many more econometrics-related computations in R, I strongly advocate for using these standard commands only to run preliminary regressions as part of your initial data analysis, but not for the final output in a paper or otherwise important document. If the output is really important, you should take your time to code the estimates, compute the standard errors, conduct hypothesis tests, etc., yourself. Only then you know exactly how your results have been obtained and what the assumptions behind them are.
Running regressions in a loop:
# The first argument of 'lm' is of the form ' Y ~ x1 + x2 ... + xK ' and is
# called a formula. Sometimes we want to specify this formula in a loop, in
# which case generating it from a string can be useful. The following code
# will regress hp on three different dummies for the number of cylinders, one
# at a time:
# Create dummies for each value of 'cyl' and add to dataframe:
library(fastDummies)
= dummy_cols(mtcars, select_columns = c("cyl"))
mtcars
# Regress hp on the number of cylinders (without intercept), (i.e. this
# computes mean horse power by different numbers of cylinders)
for (hh in c(4, 6, 8)) {
= lm(formula(paste("hp ~ -1 + cyl_", hh, sep = "")), data = mtcars)
reg print(reg)
}
# Note: we can store all these regressions into a single list! See also
# section on 'Workspace Management'
Probit Model:
# Regress dummy on having hp above average on mpg and gear:
= glm((hp > mean(mtcars$hp)) ~ mpg + gear, family = binomial(link = "probit"),
reg data = mtcars)
summary(reg)
IV Regression:
# Run an IV regression with mpg being endogenous regressor gear being exogenous
# regressor and cyl and disp being IVs:
library(ivreg)
ivreg(hp ~ mpg + gear | gear + cyl + disp, data = mtcars)
# i.e. hp is dependent variable, after '~' come all regressors (endogenous and
# exogenous), and then after '|' come the exogenous regressors and IVs
Least-Absolute Deviations (LAD) regression:
library(quantreg)
= rq(hp ~ mpg + gear, data = mtcars)
lad_two summary(lad_two)
# Load data for examples:
library(AER)
data(Fatalities)
# Preliminary Things
# Declare dataframe 'Fatalities' as panel data:
library(plm)
= pdata.frame(Fatalities, index = c("state", "year"), drop.index = F, row.names = T)
pdata
# Compute ratio of number of fatal accidents to population:
"fatalToPop"] = pdata$fatal/pdata$pop
pdata[,
# Run Regressions regress fatal accidents per capita on income, drinking age
# and latter's interaction with percentage of young drivers:
# Pooled OLS:
plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata, model = "pooling")
# Random Effects:
plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata, model = "random")
# Fixed Effects - Within:
plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata, model = "within",
index = c("state", "year"))
# Fixed Effects - First-Difference:
plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata, model = "fd",
index = c("state", "year"))
# Compute clusterd standard errors for FE-W regression:
= plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata,
regFEW model = "within", index = c("state", "year"))
vcovHC(regFEW, cluster = "group", type = "sss")
# Note: there are many ways to cluster your standard errors... That's why it's
# best if you code this yourself.
# Print output with adjusted SEs:
coeftest(regFEW, vcovHC(regFEW, cluster = "group", type = "sss"))
# Conduct F-test for income, drinkage and drinkage*youngdrivers being jointly
# insignificant:
linearHypothesis(regFEW, test = "F", c("income=0", "drinkage=0", "drinkage:youngdrivers=0"),
vcov. = vcovHC(regFEW, cluster = "group", type = "sss"))
# Joint significance test on the individual (not time) fixed effects of a panel
# regression
= plm(fatalToPop ~ income + drinkage + drinkage * youngdrivers, data = pdata,
regPOLS model = "pooling")
pFtest(regFEW, regPOLS)
# The following functions are useful when doing PD F.E. regressions manually:
# compute FDs of a variable for each unit:
library(doBy)
= function(mX, indUnit, indCol) {
fFDColumnByUnit = summaryBy(list(c(colnames(mX)[indCol]), c(colnames(mX)[indUnit])), FUN = function(x) {
mHelp -1] - x[-length(x)]
x[data = mX)
}, as.vector(t(as.matrix(mHelp[, -1])))
}# compute time-demeaned values of a variable for each unit:
= function(mX, indUnit, indCol) {
fDemeanColumnByUnit = summaryBy(list(c(colnames(mX)[indCol]), c(colnames(mX)[indUnit])), FUN = function(x) {
mHelp - mean(x)
x data = mX)
}, as.vector(t(as.matrix(mHelp[, -1])))
}
# Apply: compute FDs of fourth column (the unit (here: state) is specified in
# first column):
fFDColumnByUnit(pdata, 1, 4)
# compute time-demeaned fourth column:
fDemeanColumnByUnit(pdata, 1, 4)
# Load data for examples:
data("USAccDeaths")
data("UKDriverDeaths")
The above discussion on linear regressions also applies for time series data. However, there are benefits of declaring your time series data as an “xts” object, as this facilitates certain manipulations.
Create “xts” object:
library(xts)
# First put both time series from matrix to vector format:
= as.vector(t(UKDriverDeaths))
vUKdeaths = as.vector(t(USAccDeaths))
vUSdeaths
# Now create xts objects from these numeric vectors: first create vector of
# dates:
= seq(as.Date("1969-01-01"), length = length(vUKdeaths), by = "months")
vDatesUK # then create xts-vector of your TS:
= xts(x = vUKdeaths, order.by = vDatesUK)
vUKdeaths
# Same for USdeaths:
= seq(as.Date("1973-01-01"), length = length(vUSdeaths), by = "months")
vDatesUS = xts(x = vUSdeaths, order.by = vDatesUS)
vUSdeaths
# Combine the two datasets by dates:
merge(vUSdeaths, vUKdeaths, all = TRUE)
# Here's how to create an xts-matrix, with multiple variables at once:
library(carData)
data(Hartnagel)
= seq(as.Date("1931-01-01"), length = nrow(Hartnagel), by = "years")
vDates = xts(x = Hartnagel[, c("mtheft", "ftheft")], order.by = vDates)
mTheft
# Let's combine the two monthly time series above, by date:
= merge(vUSdeaths, vUKdeaths, all = TRUE)
mDeaths colnames(mDeaths) = c("US", "UK")
= mDeaths[is.na(mDeaths$US) == FALSE, ] # trim to non-missing dates
mDeaths
# note that this stays an xts object
Easy manipulations with “xts” objects:
# Turn monthly into quartlery series by keeping only beginning-of-quarter
# months:
= apply.quarterly(vUKdeaths, last)
vUKdeathsQ # similarly for taking averages
# Compute differences:
diff(vUKdeaths, 1) # first differences
diff(vUKdeaths, 4) # fourth differences (y_t - y_{t-4})
# Compute lags & leads:
lag(vUKdeaths, 1) # first lag
lag(vUKdeaths, 2) # second lag
# Use lags in regression: fit AR(2) model on inflation:
= lm(vUKdeaths ~ lag(vUKdeaths, 1) + lag(vUKdeaths, 2))
reg
# Note: lag() works only for xts objects! for ts objects it doesn't, but you
# get no error message!!!
# compute percentage growth (log differences):
= diff(log(mDeaths$US))[-1]
vUSdeathsGrowth = diff(log(mDeaths$UK))[-1]
vUKdeathsGrowth # (we drop first observations as it is equal to NA; can only start growth rates
# from second entry)
# Compute ACF and PACF:
acf(vUKdeaths[-1])
pacf(vUKdeaths[-1])
# Compute cross-correlations of two series (here up to 3 leads and lags):
ccf(as.numeric(vUSdeathsGrowth), as.numeric(vUKdeathsGrowth), 3, plot = FALSE)
# Working with dates in xts objects:
library(lubridate)
# access vector of dates corresponding to xts-object 'vUSdeathsGrowth':
index(vUSdeathsGrowth)
# see the month for each date:
month(index(vUSdeathsGrowth))
# Note:: plot of xts object is different than simply plotting a vector:
plot(vUSdeathsGrowth)
plot(as.numeric(vUSdeathsGrowth))
# (R recognizes by default that the x-axis are dates, that we want a line,
# etc.) But more on plotting in the separate section.
With the package “quantmod”, we can download time series data directly from the FRED Database (and other sources, e.g. Yahoo Finance). It comes in “xts” format:
library(quantmod)
# download monthly US CPI data:
getSymbols("CPIAUCSL", src = "FRED", auto.assign = FALSE)
# by default, it is stored by creating an xts object of the same name, CPIAUCSL
# download monthly US CPI data and store it as object 'cpi':
= getSymbols("CPIAUCSL", src = "FRED", auto.assign = FALSE)
cpi
# download quarterly GDP deflator:
getSymbols("GDPDEF", src = "FRED")
# adjust the dates:
= GDPDEF["1960-01-01/2025-01-01"]
GDPDEF # if several series have to be trimmed to same dates, it makes sense to define
# this as a function:
= "1960-01-01"
sStartDate = "2025-01-01"
sEndDate = function(x) {
fMyTimeRange paste(sStartDate, sEndDate, sep = "/")]
x[
}
# Compute annualized inflation:
= 100 * 4 * diff(log(GDPDEF))[-1]
vInfl
# compute GDP growth for same dates:
getSymbols("GDP", src = "FRED")
= fMyTimeRange(GDP)
GDP = 100 * 4 * diff(log(GDP))[-1] vGDPg
For some things, like fitting ARMA models and forecasting with them, it’s easier to work with “ts”-objects rather than “xts”-objects:
# Create ts-object:
= ts(vUSdeathsGrowth, start = c(19373, 2), frequency = 12)
vUSdeathsGrowthTS
# ARIMA estimation:
# Estimate AR(1) process
= arima(vUSdeathsGrowthTS, order = c(1, 0, 0))
model # Estimate ARMA(2,1) process
= arima(vUSdeathsGrowthTS, order = c(2, 0, 1))
model # the middle argument stands for integration order (ARIMA process)
summary(model)
# Forecasting:
library(forecast)
# Forecast up to 10 periods ahead and compute two-sided 90% and 95% CIs:
= forecast(model, h = 10, level = c(90, 95))
mForecasts # Plot these forecasts with fanchart:
plot(mForecasts, PI = TRUE, showgap = FALSE, shaded = TRUE)
Work with reduced-form VARs:
library(vars)
= data.frame(US = vUSdeathsGrowth, UK = vUKdeathsGrowth)
mData
# Estimate a VAR(p):
= VAR(mData, p = 2) # add type='trend' to allow for trend
modelVAR summary(modelVAR)
= residuals(modelVAR)
vResids = summary(modelVAR)
sumModelVAR = sumModelVAR$covres # variance-covariance matrix of errors
mSigma coef(modelVAR) # coefficients (Phi's)
# Forecast with VAR(p):
predict(modelVAR, n.ahead = 5, level = c(90, 95))
# Could plot forecasts with below command, but it's better idea to create nice
# plots yourself! fanchart( predict(modelVAR,n.ahead = 5, level = c(90,95)) )
Simulate ARIMA models:
arima.sim(list(ar = c(0.7, 0.2)), n = 50 * 4, rand.gen = rnorm)
# can also supply innovations manually; see ? arima.sim but could also code
# whole ARIMA simulation manually...
A primary reason behind the popularity of R is the fact that one can create plots very flexibly. Before we discuss more advanced plotting-techniques based on the package “ggplot2”, let’s start with some basic plots that can be constructed using pre-built commands in R. They are useful for checking quickly how something looks in the data-exploratory process (and for inclusion in preliminary reports).
# Generate data to be plotted:
set.seed(50)
= seq(0.1, 7, by = 0.1)
vx = dnorm(vx, 4, 2)
vNormPDF = dchisq(vx, 3)
vChi2PDF = dt(vx, 10)
vtPDF
library(mvtnorm)
= c(2, 5, 7)
vMeans = diag(c(2, 1, 3))
mVariance 1, 2] = 1
mVariance[2, 1] = 1
mVariance[= rmvnorm(30, vMeans, mVariance)
mNormDraws
# Load further data from actual dataset:
library(carData)
data(Hartnagel)
data(mtcars)
Plot single line:
# default:
plot(vNormPDF)
# to get line instead of dots:
plot(vNormPDF, type = "l")
# for line and dots:
plot(vNormPDF, type = "o")
# Note that by default, index is on the x-axis (1,2,...). To specify x-axis:
plot(vx, vNormPDF, type = "l")
# specify y-limits, and x-limits:
plot(vx, vNormPDF, type = "l", ylim = c(0, 0.25), xlim = c(-2, 9))
# specify title, x-axis-label and y-axis-label:
plot(vx, vNormPDF, type = "l", main = "My Title Here", xlab = "my x lab", ylab = "my y lab")
# note: can also be used to delete default labs: (default has no title)
plot(vx, vNormPDF, type = "l", xlab = "", ylab = "")
# specify color of markers (line and/or dots): (most standard colors are
# supported)
plot(vx, vNormPDF, type = "l", col = "blue")
plot(vx, vNormPDF, col = "blue")
# Make line thicker:
plot(vx, vNormPDF, type = "l", lwd = 4)
# Remove ticks for y-axis:
plot(vx, vNormPDF, type = "l", yaxt = "n")
# Remove ticks for x-axis:
plot(vx, vNormPDF, type = "l", xaxt = "n")
# (these commands make more sense if the corresponding axis-label is also
# deleted)
With package “latex2exp”, we can add math to labels (x-lab, y-lab, title):
library(latex2exp)
= TeX("$\\theta_t $ (I made it up)")
sXlab = TeX("$f(\\theta_t)$")
sTitle plot(vx, vNormPDF, type = "l", xlab = sXlab, ylab = "", main = sTitle)
# The latex-part of the expression starts and ends with a dollar sign. The
# code is just as in latex, except that one has to use two backlashes to call
# latex-symbols.
# Taken together, nice single-line plot:
plot(vx, vNormPDF, type = "l", col = "blue", lwd = 4, main = "my pretty plot of N(4,4)",
xlab = "", ylab = "", yaxt = "n", ylim = c(0, 0.22), xlim = c(-1, 8))
We can easily add further lines to such an existing plot:
# Initial plot:
plot(vx, vNormPDF, type = "l", col = "blue")
# Add another line (chi^2-disttribution, in red):
lines(vx, vChi2PDF, type = "l", col = "red")
# Note that the y-axis is not adjusted automatically so that the new line fits.
# Add mathematically defined lines:
abline(v = 3, col = "green") # add vertical line at x=3
abline(h = 0.08, col = "gray") # add horizontal line at y=0.08
abline(a = -0.2, b = 0.1, col = "orange") # add line with specific intercept and slope
# Taken together, nice lines plot:
plot(vx, vNormPDF, type = "l", col = "blue", lwd = 4, main = "my pretty plot of N(4,4) and Chi2(3)",
xlab = "", ylab = "", ylim = c(0, 0.3), xlim = c(-1, 8))
lines(vx, vChi2PDF, type = "l", col = "red", lwd = 4)
Note that we can also directly plot a function, without first manually specifying a grid of x-values and evaluating the function at those values:
# Plot standard normal pdf:
plot(dnorm, xlim = c(-2, 8))
# Plot pdf of N(4,4):
plot(function(x) {
dnorm(x, 4, 2)
xlim = c(-2, 8))
},
# Plot exponential function:
plot(exp, xlim = c(1, 3))
Of course, all of the options discussed above can also be applied for these plots. They also work for all of the plots discussed below (scatter, histogram, boxplot, …).
Saving a plot as a .png or .pdf file:
# 1. specify your path:
= "/Users/markomlikota/Dropbox/myfolder"
sMyPlotPath setwd(sMyPlotPath)
# 2. specify name:
png("blabla.png") # specify width and height here for nice plots
# 3. create the plot (or print an existing, already saved one with 'print()'):
plot(vx, vNormPDF, type = "l", col = "blue", lwd = 4, main = "my pretty plot of N(4,4) and Chi2(3)",
xlab = "", ylab = "", ylim = c(0, 0.3), xlim = c(-1, 8))
lines(vx, vChi2PDF, type = "l", col = "red", lwd = 4)
# 4. tell R that you're done: (no further lines or so will be added)
dev.off()
Note that it is not possible, without rather complicated adjustments, to save such a plot as an object. This is different for plots created via “ggplot2”.
Scatter-Plot:
# To obtain a scatter plot, simply specify two vectors (as before) and use dots
# rather than a line as markers:
plot(mNormDraws[, 1], mNormDraws[, 2])
# Change marker type:
plot(mNormDraws[, 1], mNormDraws[, 2], pch = 8)
plot(mNormDraws[, 1], mNormDraws[, 2], pch = 15)
# (see e.g.
# https://www.math.ucla.edu/~anderson/rw1001/library/base/html/points.html for
# further marker-types)
# add further cloud of points in different color:
points(mNormDraws[, 1], mNormDraws[, 3], col = "red")
# add regression line for first cloud of points:
abline(lm(mNormDraws[, 2] ~ mNormDraws[, 1]))
# for second:
abline(lm(mNormDraws[, 3] ~ mNormDraws[, 1]), col = "red")
# Should enlarge x- and y-axis to make this more appealing.
# Taken together, nice scaatter-plot:
plot(mNormDraws[, 1], mNormDraws[, 2], col = "blue", pch = 4, lwd = 2, main = "my pretty scatterplot of Normal draws",
xlab = "", ylab = "", ylim = c(2, 11), xlim = c(-2, 5))
points(mNormDraws[, 1], mNormDraws[, 3], col = "red", pch = 4, lwd = 2)
abline(lm(mNormDraws[, 2] ~ mNormDraws[, 1]), col = "blue")
abline(lm(mNormDraws[, 3] ~ mNormDraws[, 1]), col = "red")
Histogram:
# Default:
hist(mNormDraws[, 1])
# Add color:
hist(mNormDraws[, 1], col = "blue")
# Add border color:
hist(mNormDraws[, 1], col = "blue", border = "white")
# Remove fill-color:
hist(mNormDraws[, 1], col = NULL, border = "blue")
# Use density rather than frequency on y-axis: (so that area of bins sums up to
# 1)
hist(mNormDraws[, 1], freq = FALSE)
# Specify number of bins by specifying location of bin-breaks:
hist(mNormDraws[, 1], breaks = seq(-2, 6, by = 0.5))
# Specify number of bins:
hist(mNormDraws[, 1], breaks = 8)
# (doesn't work perfectly; R uses it only as suggestion; see ?hist)
# Taken together, nice histogram:
hist(mNormDraws[, 1], breaks = seq(-1, 5, by = 0.5), col = "blue", border = "white",
main = "my pretty histogram of Normal draws", xlab = "", ylab = "")
Boxplot:
# Boxplot of first vector of Normal draws:
boxplot(mNormDraws[, 1])
# of first two:
boxplot(mNormDraws[, 1], mNormDraws[, 2])
# of all three: (i.e. of all columns in supplied matrix)
boxplot(mNormDraws)
# Naming columns in that matrix changes x-tick-labels:
colnames(mNormDraws) <- c("first", "second", "third")
boxplot(mNormDraws)
# With dataframes, one can directly choose to get boxplots of one variable as a
# function of (values of) another:
boxplot(hp ~ cyl, data = mtcars, xlab = "cyls", ylab = "", main = "Boxplot of HP by #cyls")
# (rather than first manually creating vectors that show hp for each value of
# cyl)
QQ-Plot:
qqnorm(vtPDF, xlab = "N(0,1)", ylab = "t(10)", col = "blue")
qqline(vtPDF, col = "red")
Display multiple plots as panels in a single figure:
library(dyn)
par(mfrow = c(1, 3)) # 1 x 3 plot
hist(mNormDraws[, 1], breaks = seq(-1, 8, by = 1), col = "blue", border = "white",
main = "hist var1", xlab = "", ylab = "")
hist(mNormDraws[, 2], breaks = seq(-1, 8, by = 1), col = "blue", border = "white",
main = "hist var2", xlab = "", ylab = "")
plot(mNormDraws[, 1], mNormDraws[, 2], col = "blue", pch = 4, lwd = 2, main = "scatter",
xlab = "var1", ylab = "var2", ylim = c(2, 11), xlim = c(-2, 5))
abline(lm(mNormDraws[, 2] ~ mNormDraws[, 1]), col = "blue")
par(mfrow = c(1, 1))
# Adjust margins and specify overarching title:
par(mfrow = c(1, 3), oma = c(2, 2, 2, 2), mar = c(2, 2, 2, 2))
# (see https://www.r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)
hist(mNormDraws[, 1], breaks = seq(-1, 8, by = 1), col = "blue", border = "white",
main = "hist var1", xlab = "", ylab = "")
hist(mNormDraws[, 2], breaks = seq(-1, 8, by = 1), col = "blue", border = "white",
main = "hist var2", xlab = "", ylab = "")
plot(mNormDraws[, 1], mNormDraws[, 2], col = "blue", pch = 4, lwd = 2, main = "scatter",
xlab = "var1", ylab = "var2", ylim = c(2, 11), xlim = c(-2, 5))
abline(lm(mNormDraws[, 2] ~ mNormDraws[, 1]), col = "blue")
mtext("Bivariate Normal Draws", outer = TRUE, cex = 1.5)
par(mfrow = c(1, 1))
# Empty plot: (useful when creating e.g. a plot with 3 x 3 subplots and you
# have only 8 plots)
plot(c(0, 0), type = "n", ylab = "", xlab = "", axes = FALSE)
Plotting xts objects:
# When plotting xts objects, we get a plot with lots of default-adjustments (R
# realizes that the x-axis refers to time, returns a grid, etc.):
# To illustrate this, we download data on GDP and inflation directly from the
# FRED database:
library(quantmod)
# Inflation (based on GDP deflator):
getSymbols("GDPDEF", src = "FRED")
= GDPDEF["1960-01-01/2025-01-01"]
GDPDEF = 100 * 4 * diff(log(GDPDEF))[-1]
vInfl # GDP growth:
getSymbols("GDP", src = "FRED")
= GDP["1960-01-01/2025-01-01"]
GDP = 100 * 4 * diff(log(GDP))[-1]
vGDPg
plot(vInfl, main = "Inflation (quarterly, annualized, %)")
# delete y-axis ticks on the right:
plot(vInfl, main = "Inflation (quarterly, annualized, %)", yaxis.right = FALSE)
# Plot two series on same graph:
plot(cbind(vInfl, vGDPg), main = "Inflation & GDP Growth")
# (better than manually;) plot(vInfl, main = 'Inflation & GDP Growth',
# yaxis.right = FALSE) lines(vGDPg, col='red', yaxis.right = FALSE)
With the package “ggplot2”, one has much more flexibility in creating plots.
We’ll start with a plot containing a single line and then discuss multiple lines, scatter plots, histograms and other plot-types. Various plot-options are discussed, and each is introduced as part of the discussion of the plot-type for which it is best explained.
(A further popular package for plotting in R is “plotly”. However, in my view “ggplot2” is better in all aspects except for creating interactive plots.)
Plot single line:
library(ggplot2)
= data.frame(xvar = vx, norm = vNormPDF) mData
# Plot single line based on two vectors:
ggplot() + geom_line(aes(x = vx, y = vNormPDF))
# Plot single line based on a dataframe:
ggplot(data = mData) + geom_line(aes(x = xvar, y = norm))
# To change appearance of line, specify further attributes in the option
# 'geom_line()'. e.g. change color, size and linetype:
ggplot(data = mData) + geom_line(aes(x = xvar, y = norm), color = "blue", size = 0.9,
linetype = "dashed")
# see http://sape.inf.usi.ch/quick-reference/ggplot2/linetype for linetypes
# available note: can specify linetype also as number (1 is solid, 2 is dashed,
# etc.)
# Add further options with pluses:
ggplot(data = mData) + geom_line(aes(x = xvar, y = norm), color = "blue", size = 0.9) +
labs(x = "my x-lab", y = "my y-lab", title = "my title") + theme_bw() + theme(aspect.ratio = 5/8)
Save plot to object:
= ggplot(data = mData) + geom_line(aes(x = xvar, y = norm), color = "blue", size = 0.9) +
pp1 labs(x = "my x-lab", y = "my y-lab", title = "my title")
# Then, can add further options to already existing plot:
+ theme_bw() + theme(aspect.ratio = 5/8)
pp1
# And can save to new object (or override existing one):
= pp1 + theme_bw() + theme(aspect.ratio = 5/8)
pp
# You may want to re-use some options over and over again. Instead of typing
# them every time, store them as a list and call them at once all together:
= list(theme_bw(), theme(aspect.ratio = 5/8))
plotOptions
# Show the plot 'pp' with these options:
+ plotOptions
pp1 # Add these options to plot 'pp1':
= pp1 + plotOptions pp1
Save plot as .png or .pdf file:
# 1. create plot or print/call existing one:
pp# 2. save:
ggsave("myprettyplot.png")
# save with advanced options:
ggsave("myprettyplot.png", width = 15, height = 15 * (8/5), units = "cm")
# note: width/height should match previously specified aspect ratio
# Note: for inclusion in latex-document, save plot best as pdf
As before, can have latex-code for labels:
library(latex2exp)
= TeX("$\\theta_t $ (I made it up)")
sXlab = TeX("$ f(\\theta_t) $") sTitle
+ labs(x = sXlab, y = "", title = sTitle) pp1
There are many more options to customize the x-axis, y-axis and title:
# Adjust limits and ticks of x-axis and limits of y-axis:
+ scale_x_continuous(breaks = seq(-1, 8, 3), limits = c(-1, 8)) + scale_y_continuous(limits = c(0,
pp 0.25))
# Equivalent to the latter option:
+ ylim(c(0, 0.25))
pp # Also equivalent:
+ expand_limits(y = c(0, 0.25))
pp # When changing only upper limit, type:
+ expand_limits(y = c(NA, 0.25))
pp # (analogously for only lower limit)
# Manually specify labels of ticks on x-axis:
+ scale_x_continuous(limits = c(-1, 8), breaks = c(0, 3, 4, 6), labels = paste(c(0,
pp 3, 4, 6), "a", sep = ""))
# note: by specifying empty labels for certain breaks, we can have ticks
# without labels:
= paste(c(0, 3, 4, 6), "a", sep = "")
vLabs 2] = ""
vLabs[+ scale_x_continuous(limits = c(-1, 8), breaks = c(0, 3, 4, 6), labels = vLabs)
pp
# Change axis values from decimals to percent (e.g. 0.01 becomes 1%): (just for
# illustration; it doesn't make any sense here)
+ scale_y_continuous(labels = scales::percent)
pp
# Have x-axis in units of pi:
= vx/pi
vxDivPi + scale_x_continuous(limits = c(-1, 8), breaks = pi * (0:3), labels = paste(0:3,
pp "pi"))
# (and analogously for any other transformation)
# Force all tick-labels to have same, specified number of decimals
= function(x) sprintf("%.3f", x)
scaleFUN + scale_y_continuous(limits = c(0, 0.25), breaks = seq(0, 0.25, by = 0.05), lab = scaleFUN)
pp # (without this option, we'd have two decimals in this case) (the option is
# also useful in some cases where R by default gives different ticks with
# different numbers of decimals )
# Remove axis ticks and tick-labels:
+ theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
pp
# To remove axis-label, can use +labels(ylab='') as above, or:
+ theme(axis.title.y = element_blank())
pp
# Change font size for x-axis label:
+ theme(axis.title.x = element_text(size = 12))
pp
# Change font size of tick labels and space between axis and tick labels:
+ theme(axis.text.x = element_text(size = 12, vjust = -1))
pp
# Rotate x-axis labels:
+ theme(axis.text.x = element_text(angle = 30, hjust = 0.5, vjust = 0.8))
pp
# Change font size and family of title:
+ theme(plot.title = element_text(family = "sans", size = 14))
pp # Change also margins:
+ theme(plot.title = element_text(family = "sans", size = 14, margin = margin(0,
pp 0, 2, 0)))
# (instead of margin, can also specify vjust and hjust)
If you have time on the x-axis, use the option “scale_x_date()” rather than “scale_x_manual()”:
# Obtain quarterly GDP growth and inflation for illustration:
library(quantmod)
getSymbols("GDPDEF", src = "FRED")
= GDPDEF["1960-01-01/2025-01-01"]
GDPDEF = 100 * 4 * diff(log(GDPDEF))[-1]
vInfl
getSymbols("GDP", src = "FRED")
= GDP["1960-01-01/2025-01-01"]
GDP = 100 * 4 * diff(log(GDP))[-1]
vGDPg
= data.frame(date = index(vGDPg), GDPgrowth = as.numeric(vGDPg), inflation = as.numeric(vInfl))
mDataTS # sidenote: without 'as.numeric()', the column names would be given by the
# column names of the xts-objects 'vGDPg' and 'vInfl' and cannot be overwritten
# Plot GDP growth:
= ggplot(mDataTS) + geom_line(aes(x = date, y = GDPgrowth), size = 0.9) + scale_x_date(date_labels = "%Y",
ppTS date_breaks = "2 years", limits = as.Date(c("2000-01-01", "2020-01-01")))
# Sometimes, the breaks don't appear as desired. e.g.
= ggplot(mDataTS) + geom_line(aes(x = date, y = GDPgrowth), size = 0.9) + scale_x_date(date_labels = "%Y",
ppTS date_breaks = "5 years", limits = as.Date(c("2000-01-01", "2025-01-01")))
# In such cases, it's a good idea to set 'breaks' rather than 'date_breaks',
# for which it's useful to load package 'lubridate':
library(lubridate)
= ggplot(mDataTS) + geom_line(aes(x = date, y = GDPgrowth), size = 0.9) + scale_x_date(date_labels = "%Y",
ppTS breaks = seq(ymd("2000/01/01"), ymd("2025/01/01"), by = "5 years"), limits = as.Date(c("2000-01-01",
"2025-01-01")))
# The comments from the section on string-dates above apply; if your dates are
# not in the format required by R ('YYYY-MM-DD'), you must first change them to
# this format using the command 'as.Date()'. e.g.
as.Date("31.01.2020", "%d.%m.%Y")
Further useful options:
# Remove grid:
+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
pp
# Remove panel border (the box around the main plot area):
+ theme(panel.border = element_blank())
pp
# Remove panel border but add x- and y-axis lines:
+ theme(panel.border = element_blank()) + theme(axis.line = element_line(color = "black",
pp size = 0.2))
# Add margins of 1cm all around plot:
+ theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
pp
# Shading of specified area: (useful for confidence intervals)
+ geom_ribbon(aes(ymin = vNormPDF * 0.9, ymax = vNormPDF * 1.1, x = vx), alpha = 0.2,
pp fill = "blue")
# Note. can add several layers:
+ geom_ribbon(aes(ymin = vNormPDF * 0.9, ymax = vNormPDF * 1.1, x = vx), alpha = 0.2,
pp fill = "blue") + geom_ribbon(aes(ymin = vNormPDF * 0.8, ymax = vNormPDF * 1.2,
x = vx), alpha = 0.2, fill = "blue")
# Annotate text at given (x,y)-point:
+ annotate("text", label = "abc", x = 2, y = 0.07)
pp # (could specify hjust and vjust for further customization)
# note: with date on x-axis, need to put wrapper 'as.Date()' around date:
+ annotate("text", label = "abc", x = as.Date("2003-01-31"), y = 0.1)
ppTS
# Flip axes:
+ coord_flip() pp
# Note that the options can be combined, and they must be if they belong to the
# same option-command; e.g. don't write
theme(panel.border = element_blank())
# and then
theme(axis.title.x = element_text(size = 12))
# but write
theme(panel.border = element_blank(), axis.title.x = element_text(size = 12))
# Note that this can create issues for the use of a list of pre-defined
# options. Sometimes when the same option is called several times, the order
# how they are called can change the outcome.
# Taken together, nice single-line plots:
= list(theme_bw(), theme(aspect.ratio = 5/8, axis.title.x = element_text(size = 12),
plotOptions axis.title.y = element_text(size = 12), axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12), panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
plot.title = element_text(family = "sans", size = 14)))
ggplot(data = mData) + plotOptions + geom_line(aes(x = xvar, y = norm), color = "blue",
size = 0.9) + labs(x = sXlab, y = "", title = sTitle) + scale_x_continuous(breaks = seq(-1,
8, 3), limits = c(-1, 8)) + scale_y_continuous(limits = c(0, 0.25)) + geom_ribbon(aes(ymin = vNormPDF *
0.9, ymax = vNormPDF * 1.1, x = vx), alpha = 0.2, fill = "blue") + geom_ribbon(aes(ymin = vNormPDF *
0.8, ymax = vNormPDF * 1.2, x = vx), alpha = 0.2, fill = "blue")
ggplot(mDataTS) + plotOptions + geom_line(aes(x = date, y = GDPgrowth), color = "blue",
size = 0.9) + labs(x = "", y = "", title = "US GDP Growth (quarterly, annualized)") +
scale_y_continuous(breaks = seq(-30, 30, by = 10)) + scale_x_date(date_labels = "%Y",
breaks = seq(ymd("2000/01/01"), ymd("2025/01/01"), by = "5 years"), limits = as.Date(c("2000-01-01",
"2025-01-01")))
Scatter-Plot:
= data.frame(norm1 = mNormDraws[, 1], norm2 = mNormDraws[, 2]) mData
# A scatter plot is created analogously to a line plot. We just use the option
# geom_point rather than geom_line.
# Based on two vectors:
ggplot() + geom_point(aes(x = mNormDraws[, 1], y = mNormDraws[, 2]))
# Based on data frame:
ggplot(data = mData) + geom_point(aes(x = norm1, y = norm2))
# Change marker appearance:
= ggplot(data = mData) + geom_point(aes(x = norm1, y = norm2), size = 1.5, shape = 4,
pp1 color = "blue")
# see http://sape.inf.usi.ch/quick-reference/ggplot2/shape for shapes available
# Some markers (21 - 25) allow for filling:
= ggplot(data = mData) + geom_point(aes(x = norm1, y = norm2), size = 1.5, shape = 21,
pp2 color = "blue", fill = "blue")
# Can of course add all other options from above, e.g.
= pp1 + plotOptions + labs(x = "my x-lab", y = "my y-lab", title = "my nice scatter")
pp1
# Add labels to scatter-points:
= paste("a", 1:nrow(mData), sep = "") # some arbitrary labels
vMyLabels
+ geom_text(aes(x = norm1, y = norm2, label = vMyLabels), size = 2, color = "purple",
pp1 hjust = 1, vjust = -0.5)
# add labels only to some points (e.g. only first 4):
5:length(vMyLabels)] = "" # can also set to NA
vMyLabels[+ geom_text(aes(x = norm1, y = norm2, label = vMyLabels), size = 2, color = "purple",
pp1 hjust = 1, vjust = -0.5)
# Add lines to plot:
# line with specified intercept and slope (e.g. could add 45-degree line):
+ geom_abline(intercept = 3, slope = 2, size = 1, color = "black", linetype = "dotted")
pp1 # vertical line:
+ geom_vline(xintercept = 2, size = 1, color = "black", linetype = "dotted")
pp1 # regression line:
+ geom_smooth(aes(x = norm1, y = norm2), method = lm, se = FALSE, size = 1, color = "black",
pp1 linetype = "dotted")
# regression line with 90% confidence bands:
+ geom_smooth(aes(x = norm1, y = norm2), method = lm, se = TRUE, level = 0.9,
pp1 size = 1, color = "black", linetype = "dotted")
# (note: single linear regression with homoskedasticity...)
# Taken together, nice scatter plot:
ggplot(data = mData) + geom_point(aes(x = norm1, y = norm2), size = 2, shape = 21,
color = "blue", fill = "blue") + geom_smooth(aes(x = norm1, y = norm2), method = lm,
se = FALSE, size = 0.5, color = "darkblue", linetype = "solid") + labs(x = "first Normal",
y = "second Normal", title = "Scatter of Bivariate Normal Draws") + plotOptions
Plot With Multiple Lines:
# Rather than supplying several vectors, it is easier to supply a dataframe
# with all the lines one wishes to plot:
= data.frame(xvar = vx, norm = vNormPDF, chi2 = vChi2PDF, t = vtPDF)
mData
# Note: the names of the different variables will be the line-labels. They are
# a bit more difficult to change than the labels of the x- and y-axes.
# ------------------------------------------------------------------------- #
# TWO LINES
# ------------------------------------------------------------------------- #
ggplot(mData, aes(x = xvar)) + geom_line(aes(y = norm)) + geom_line(aes(y = t))
# we should distinguish the different lines by color or shape or linetype; e.g.
ggplot(mData, aes(x = xvar)) + geom_line(aes(y = norm), color = "blue") + geom_line(aes(y = t),
color = "red")
# ------------------------------------------------------------------------- #
# MANY (>=3) LINES
# ------------------------------------------------------------------------- #
# In this case, it's best to first put the dataframe into long format, (as
# there could be many lines, and adding each one separately like above would be
# cumbersome)
library(reshape2)
= melt(mData, id.vars = "xvar")
mDat
# the different variables are now distinguished in the column 'variable', and
# their values are shown in the column 'value'
# Plot the three lines, distinguishing them by color:
ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
size = 0.9) # could specify linetype (for all lines)
# distinguish by linetype:
ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(linetype = variable),
size = 0.9) # could specify color (for all lines)
# distinguish by both:
ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable,
linetype = variable))
# distinguish by line-color, point-color and point-shapes:
ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
size = 0.5) + geom_point(aes(color = variable, shape = variable), size = 1)
# Of course, we can also just add the same markers for all lines:
ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
size = 0.9) + geom_point(aes(color = variable), shape = 1)
# Set linetypes manually:
= ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(linetype = variable),
ppL size = 0.9) + plotOptions
+ scale_linetype_manual(values = c("solid", "dashed", "dotdash"))
ppL # see http://sape.inf.usi.ch/quick-reference/ggplot2/linetype for linetypes
# available note: can specify linetype also as number (1 is solid, 2 is dashed,
# etc.)
# Note: provided vector may be longer than the number of lines you have.
# That's why you can create a long vector with your preferred linetypes and
# then just call this vector, even for plots with differing number of lines.
# When setting linetypes manually, can specify/change line labels:
+ scale_linetype_manual(values = c("solid", "dashed", "dotdash"), labels = c("mycurve1",
ppL "mycurve2", "mycurve3"))
# Set colors manually:
= ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
ppC size = 0.9) + plotOptions
+ scale_color_manual(values = c("blue", "red", "purple"))
ppC
# again can specify/change line labels:
+ scale_color_manual(values = c("blue", "red", "purple"), labels = c("mycurve1",
ppC "mycurve2", "mycurve3"))
# more choice of colors is given by package 'colorspace':
library(colorspace)
# take three colors (evenly spaced) from palette 'Blues 2':
= sequential_hcl(3, palette = "Blues 2") vMyColors
+ scale_color_manual(values = vMyColors) ppC
# to avoid too light first color, construct a vector with more colors, and then
# take the few ones you need from the beginning and middle of the colorspace:
= sequential_hcl(7, palette = "Blues 2")[c(1, 3, 5)] vMyColors
+ scale_color_manual(values = vMyColors)
ppC
# Similarly, we can specify the colors of the markers as well as lines:
= ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
ppCM size = 0.5) + geom_point(aes(color = variable), size = 1) + plotOptions
+ scale_color_manual(values = vMyColors)
ppCM
# Set point-shapes manually:
= ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
ppM size = 0.5) + geom_point(aes(color = variable, shape = variable), size = 1) +
plotOptions
+ scale_shape_manual(values = c(7, 9, 24))
ppM
+ scale_shape_manual(values = c(7, 9, 24), labels = c("mycurve1", "mycurve2",
ppM "mycurve3"))
# see http://sape.inf.usi.ch/quick-reference/ggplot2/shape for shapes available
Legend options:
# Take plot for illustration:
= ppC + scale_color_manual(values = vMyColors)
ppC
# Remove legend title:
+ theme(legend.title = element_blank())
ppC
# Remove legend altogether:
+ theme(legend.position = "none")
ppC
# Change position and direction: (by default, legend is outside the plot
# (middle-right) and vertical)
# put to middle-left:
+ theme(legend.position = "left")
ppC
# put to bottom-right:
+ theme(legend.justification = "bottom")
ppC
# make horizontal and put to bottom of plot:
+ theme(legend.direction = "horizontal", legend.position = "bottom", legend.justification = "left")
ppC
# put inside the plot:
+ theme(legend.position = c(0.8, 0.75))
ppC # ( top right is (1,1), top left is (0,1))
# In some cases, it's hard to put the legend inside the plot, but without
# crossing the lines/markers. e.g.
+ theme(legend.position = c(0.5, 0.5))
ppC # Unfortunately, one cannot make the legend background fully transparent. Make
# legend-text transparent at least:
+ theme(legend.position = c(0.5, 0.5), legend.background = element_rect(fill = alpha("white",
ppC 0)))
# Taken together, nice multiple-line plot:
ggplot(mDat, aes(x = xvar, y = value, group = variable)) + geom_line(aes(color = variable),
size = 0.9) + plotOptions + scale_color_manual(values = vMyColors) + theme(legend.title = element_blank(),
legend.position = c(0.8, 0.75)) + labs(x = "", y = "", title = "Three Important Distributions")
Scatter-plot with multiple variables:
# Analogous to above, just with geom_point instead of geom_line option.
# Create dataframe with variables (vectors) to plot:
= as.data.frame(mNormDraws)
mData colnames(mData) = c("norm1", "norm2", "norm3")
# Melt to long format:
= melt(mData, id.vars = "norm1")
mDat
# Note: variable taken here as id.var will be on x-axis. In a scatter plot,
# this could in principle be any variable. (In contrast, under the line plot,
# it was clear what the x-axis is.)
# Plot, distinguishing variables by markers:
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(shape = variable),
size = 2, color = "blue")
# Plot, distinguishing variables by colors:
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable),
size = 2)
# Plot, distinguishing variables by both:
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable,
shape = variable), size = 2)
# Add options:
= ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable,
ppMS shape = variable), size = 2) + plotOptions
# Specify marker-shapes and colors manually:
+ scale_color_manual(values = vMyColors) + scale_shape_manual(values = c(1,
ppMS 4))
# Also re-label variables:
+ scale_color_manual(values = vMyColors, labels = c("1st Normal", "2nd Normal")) +
ppMS scale_shape_manual(values = c(1, 4), labels = c("1st Normal", "2nd Normal"))
# (note: just doing it in one of the two options will produce two legends)
# Note: when distinguishing by color, by default, R will choose a round marker
# that is filled (as above). e.g.
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable),
size = 2)
# When you change the markers manually, like here, then they will not be
# filled. To add filling, must let filling also change by variable. e.g. this
# produces unfilled triangles:
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable),
size = 2, shape = 2)
# to get filled triangles:
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable,
fill = variable), size = 2, shape = 24)
# Nice, final scatter-plot with multiple variables:
ggplot(mDat, aes(x = norm1, y = value, group = variable)) + geom_point(aes(color = variable),
size = 2) + scale_color_manual(values = c("blue", "red"), labels = c("2nd Normal",
"3rd Normal")) + plotOptions + scale_y_continuous(limits = c(0, 12), breaks = c(0,
4, 8, 12)) + theme(legend.title = element_blank(), legend.position = c(0.8, 0.15)) +
labs(x = "1st Normal", y = "", title = "Multivariate Normal Draws (k=3)")
“Grouped” Scatter-Plot:
# Above, we plotted three variables in a single scatter plot, i.e. we plotted
# two variables against a third.
# In the same way, we could have plotted, say, 5 variables by plotting 4
# variables against a fifth one, yielding 4 'clouds' of points.
# Sometimes, we indeed want to plot two variables, but distinguish the points
# in the scatter by a third variable. The number of 'clouds' of points is then
# determined by how many values this third variable has.
# ------------------------------------------------------------------------- #
# DISCRETE CASE
# ------------------------------------------------------------------------- #
# Plot hp against hp, with different colors for different numbers of cylinders:
ggplot(mtcars, aes(x = mpg, y = hp, group = cyl)) + geom_point(aes(color = cyl))
# Note that R thinks of cyl as a continuous variable, even though it takes only
# three values. To fix that, code it as type 'character' rather than
# 'numeric':
= mtcars # create new dataframe..
mtcars2 $cyl = as.character(mtcars2$cyl) #.. in which cyl is of type character mtcars2
= ggplot(mtcars2, aes(x = mpg, y = hp, group = cyl)) + geom_point(aes(color = cyl))
ppMS
# We can change the colors as above; e.g.
+ scale_color_manual(values = vMyColors, labels = c("4 cyl.", "6 cyl.", "8 cyl."))
ppMS
# Also, could of course distinguish points for different numbers of cylinders
# by markers rather than colors (or both).
# In this case, cyl really needs to be of type character, because, in contrast
# to colors, markers cannot vary continuously.
# ------------------------------------------------------------------------- #
# CONTINUOUS CASE
# ------------------------------------------------------------------------- #
# Plot hp against mpg, and distinguish points by color based on value of qsec:
= ggplot(mtcars, aes(x = mpg, y = hp, group = qsec)) + geom_point(aes(color = qsec))
ppMS
# To specify colors, need to use scale_color_gradient rather than
# scale_color_manual:
+ scale_color_gradient(low = "white", high = "blue")
ppMS
# The analogous holds for the filling (if marker-shapes are specified that can
# be filled):
= ggplot(mtcars, aes(x = mpg, y = hp, group = qsec)) + geom_point(aes(color = qsec),
ppMS shape = 21) + scale_color_gradient(low = "white", high = "blue") + scale_fill_gradient(low = "white",
high = "blue")
# Nice, final grouped scatter-plot:
ggplot(mtcars, aes(x = mpg, y = hp, group = qsec)) + geom_point(aes(color = qsec,
fill = qsec), size = 3, shape = 21) + scale_color_gradient(low = "white", high = "blue") +
scale_fill_gradient(low = "white", high = "blue") + plotOptions + scale_y_continuous(limits = c(0,
300), breaks = seq(0, 300, by = 100)) + theme(legend.position = "none") + labs(title = "A Nice Grouped-Scatter Plot")
Histogram:
# ------------------------------------------------------------------------- #
# SINGLE HISTOGRAM
# ------------------------------------------------------------------------- #
# Based on single vector:
ggplot() + geom_histogram(aes(x = mNormDraws[, 2]))
# Based on column in dataframe:
ggplot(data = mData) + geom_histogram(aes(x = norm2))
# note that here we only need to specify the x-variable in aes()
# Change y-axis from frequency to density:
ggplot() + geom_histogram(aes(x = mNormDraws[, 2], y = ..density..))
ggplot(data = mData) + geom_histogram(aes(x = norm2, y = ..density..))
# Specify number of bins directly:
ggplot(data = mData) + geom_histogram(aes(x = norm2, y = ..density..), bins = 10)
# Specify number of bins with binwidth:
ggplot(data = mData) + geom_histogram(aes(x = norm2, y = ..density..), binwidth = 0.5)
# Change color and filling:
ggplot(data = mData) + geom_histogram(aes(x = norm2, y = ..density..), color = "black",
fill = "grey", alpha = 0.2)
# Add line: (here: add actual pdf of N(5,1))
= dnorm(vx, 5, 1)
vThisNormPDF ggplot() + geom_histogram(aes(x = mNormDraws[, 2], y = ..density..)) + geom_line(aes(x = vx,
y = vThisNormPDF))
# Could also do this with dataframe, but then we need same number of draws as
# 'elements' in the vector that represents the line (here: function
# evaluations)
# ------------------------------------------------------------------------- #
# MULTIPLE HISTOGRAMS IN SINGLE PLOT
# ------------------------------------------------------------------------- #
ggplot(data = mData) + geom_histogram(aes(x = norm2, y = ..density..), color = "blue",
fill = "blue", alpha = 0.2) + geom_histogram(aes(x = norm3, y = ..density..),
color = "red", fill = "red", alpha = 0.2)
# A separate histogram of one variable for each value of some third variable:
# e.g. a separate histogram of hp for each value of cyl, each with different
# color:
ggplot(mtcars2, aes(x = hp, fill = cyl)) + geom_histogram(aes(x = hp, y = ..density..,
color = cyl), position = "identity")
# Should make bars a bit transparent:
ggplot(mtcars2, aes(x = hp, fill = cyl)) + geom_histogram(aes(x = hp, y = ..density..,
color = cyl), position = "identity", alpha = 0.6)
# Taken together, nice histogram plots:
= dnorm(vx, 5, 1)
vThisNormPDF
ggplot() + plotOptions + geom_histogram(aes(x = mNormDraws[, 2], y = ..density..),
color = "blue", fill = "blue", alpha = 0.2) + geom_line(aes(x = vx, y = vThisNormPDF),
color = "purple") + labs(x = "", y = "", title = "Histogram of N(5,1) draws") +
theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
ggplot(mtcars2, aes(x = hp, fill = cyl)) + plotOptions + geom_histogram(aes(x = hp,
y = ..density.., color = cyl), position = "identity", alpha = 0.6) + theme(legend.title = element_blank(),
legend.position = c(0.8, 0.75)) + labs(x = "", y = "", title = "Histogram of hp by cyl.") +
theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())
Bar-Plot:
# Compute mean of Normal draws for each column of mNormDraws (dimension of
# multiv. Normal):
= data.frame(norms = c(1, 2, 3), means = apply(mNormDraws, 2, mean)) mDataBP
# Illustrate:
ggplot(data = mDataBP, aes(x = norms, y = means)) + geom_bar(stat = "identity")
# Adjust appearance of bars:
ggplot(data = mDataBP, aes(x = norms, y = means)) + geom_bar(stat = "identity", color = "blue",
fill = "blue", alpha = 0.2)
# To change x-axis, it is best to first make R realize that it is a discrete
# rather than continuous variable:
$norms = as.character(mDataBP$norms) mDataBP
ggplot(data = mDataBP, aes(x = norms, y = means)) + geom_bar(stat = "identity", color = "blue",
fill = "blue", alpha = 0.2) + scale_x_discrete(labels = c("1st", "2nd", "3rd"))
# Taken together, nice bar-plot:
ggplot(data = mDataBP, aes(x = norms, y = means)) + plotOptions + geom_bar(stat = "identity",
color = "blue", fill = "blue", alpha = 0.2) + scale_x_discrete(labels = c("1st",
"2nd", "3rd")) + labs(x = "", y = "", title = "Mean of Normal Draws")
Combine several plots as panels in single figure:
# Note: when producing a latex document with these plots, it is a better idea
# to produce the plots individually and assemble them in latex. (more
# flexibility)
# Three plots for illustration:
= list(plotOptions, ylim(c(0, 6)), scale_x_continuous(limits = c(-2,
plotOptionsCommonHere 10), breaks = seq(-2, 10, by = 2)))
= ggplot(data = mData) + plotOptionsCommonHere + geom_histogram(aes(x = norm1),
pp1 color = "blue", fill = "blue", alpha = 0.2) + labs(x = "", y = "", title = "norm1")
= ggplot(data = mData) + plotOptionsCommonHere + geom_histogram(aes(x = norm2),
pp2 color = "blue", fill = "blue", alpha = 0.2) + labs(x = "", y = "", title = "norm2")
= ggplot(data = mData) + plotOptionsCommonHere + geom_histogram(aes(x = norm3),
pp3 color = "blue", fill = "blue", alpha = 0.2) + labs(x = "", y = "", title = "norm3")
# It might be useful to create an empty plot: (e.g. when creating a 3x3 figure,
# but have 8 plots)
ggplot() + theme_void()
# ------------------------------------------------------------------------- #
# FIRST APPROACH
# ------------------------------------------------------------------------- #
library(gridExtra)
library(ggpubr)
# (need both packages)
# Create 1 x 2 plot:
grid.arrange(pp1, pp2, ncol = 2)
# To save:
ggsave("myplot1.pdf", arrangeGrob(pp1, pp2, ncol = 2))
# Add title:
= text_grob("Draws from two Normals", size = 14, face = "bold", vjust = 5)
title grid.arrange(pp1, pp2, ncol = 2, top = title)
# Can specify widths:
grid.arrange(pp1, pp2, ncol = 2, top = title, widths = c(2, 2))
# Can also specify ordering of plots (useful when many more than just 2 plots
# are added):
grid.arrange(pp1, pp2, ncol = 2, top = title, widths = c(2, 2), layout_matrix = rbind(c(1,
2)))
# All of these options need to be specified also in 'arrangeGrob()' when saving
# the plot.
# Note: it would make sense to delete y-axis ticks on the right plot as they
# are common to both and already shown in the left plot.
# ------------------------------------------------------------------------- #
# SECOND APPROACH
# ------------------------------------------------------------------------- #
library(patchwork)
# Create 1 x 2 plot as above:
| pp2)
(pp1
# Here we have the advantage that we can save the resulting plot and call it
# later:
= (pp1 | pp2)
ppTotal
ppTotal
# Add title:
+ plot_annotation(title = "Draws from two Normals")
ppTotal
# With this approach, we do not need a special command for saving the plots,
# but can use 'ggsave()'. However, one needs to create a large plot, because
# with many plots, default size window is usually too small:
ggsave("myplot2.pdf", width = 40, height = 25, units = "cm")
# Note: changing the width and height will also change the spacing between
# figures.
# 2 x 3 plots:
| pp1 | pp1)/(pp1 | pp1 | pp1)
(pp1
# Can have even more flexible plot-arrangement:
= "
layout AABB
AABB
EEEE
EEEE
"
+ pp1 + pp1 + plot_layout(design = layout)
pp1 ggsave("myplot2.pdf", width = 40, height = 25, units = "cm")
# For plots with legends: If all plots have the same legend, you can collect
# all them into one single legend using the following commands:
+ plot_layout(guides = "collect") + theme(legend.position = "bottom") ppTotal
There are many more things that ggplot2 can do. For example, to create a grid-plot, use “geom_raster()” or “geom_rect()”, to create an area-plot, use “geom_area()”. A google search usually shows quick how to use these commands.
A final, important note: when iteratively adding layers to a ggplot in a loop (e.g. adding one line at a time to an existing plot), you need to use double quotation marks. Otherwise, ggplot creates the plot only once the loop is finished, which may lead to unexpected behavior. See this post: https://stackoverflow.com/questions/52671480/adding-multiple-ribbons-in-ggplot.
If you have a big project, it makes sense to break up your code into several R-scripts. From one script, you can call another using the following command:
source("mySeparateScript.R") # load/run external script
You can also store all objects (the whole current workspace) to an .RData-file, which you can load in other scripts:
# Save all objects from the current workspace to an '.RData'-file:
save.image(file = "myWorkspace.RData")
# Load the workspace from the file:
load("myWorkspace.RData")
To keep the workspace tidy/lean even in a longer code, one can use lists to store a collection of objects in a single object; a list. If in each iteration of a loop a plot is created, we can save all these plots into one list using the following commands:
= list() # create empty list before loop
lMyPlots
for (ii in 1:10) {
# create plot for iteration ii, something like this comes out: (as a matter
# of fact, typically, a plot is saved as a list (containing its
# attributes); see below)
= list(1, "a", "b")
plotHere
# save plot to 'lMyPlots':
= append(lMyPlots, list(plotHere))
lMyPlots
# and assign a name to this plot:
= paste("plot", ii, sep = "_")
nameForThisPlot
if (ii == 1) {
# at the first iteration, lMyPlots is empty, so use this:
names(lMyPlots) = c(names(lMyPlots), nameForThisPlot)
else {
} # later on, we can just use this:
names(lMyPlots)[ii] = nameForThisPlot
}
}
The same code can be used to store different regressions into a single list. See the section “Econometrics/Linear Regressions” for how to run regressions in a loop.
Saving a collection of different objects into a single list also enables us to delete a bunch of objects at once. This way, at the end of some longer section of code in our script, we can delete all objects that aren’t necessary for subsequent computations:
<- 4
a <- 3
b
# remove all objects but these two:
rm(list = ls()[!(ls() %in% c("lMyList", "lMyPlots"))])
# If you want to iteratively (e.g. after each section of your code) delete a
# bunch of objects, use this code:
<- 4
a <- 3
b
# Define a vector containing the names of the objects you want to keep:
= c("vToKeep", "lMyList", "a")
vToKeep # Delete all objects but these:
rm(list = ls()[!(ls() %in% vToKeep)])
# At a later point, once you created some new objects ...:
<- 3
b <- 5
c # ... add the newly created objects that you want to keep to vToKeep ...:
= c(vToKeep, "b")
vToKeep # ... and delete all the rest:
rm(list = ls()[!(ls() %in% vToKeep)])
# (note that vToKeep contains the name of 'vToKeep' itself, otherwise this
# vector would be deleted too)
Sometimes, it is inevitable that some code returns an error under some conditions, but we still want the code to continue running, i.e. we attempt at doing one thing, but if it doesn’t work, we try another thing. For such code-chunks,we can use “tryCatch”:
# Try evaluating 'fLL()' (log-likelihood of some model) at vTheta, and return
# -99 if it doesn't work:
tryCatch(fLL(vTheta), error = function(e) {
-99
})
# Try computing log-determinant of mA, and return NA if it doesn't work:
tryCatch(log(det(mA)), error = function(e) {
NA
})
Sometimes, we want to record the time that R needs to execute a certain chunk of code:
# Record the time at a given point: (typically before the code starts that you
# want to time)
= Sys.time()
timeStart
# ... do operations ...
# Compute the time elapsed since ' timeStart' :
= Sys.time() - timeStart
timeTotal
# To record the time needed to execute a single command/function: (here we
# record the time needed to draw 100 times from N(0,1))
system.time(rnorm(100))
# Let R 'sleep' for 4 seconds, i.e. do nothing:
Sys.sleep(4)
We can use linear interpolation to impute missing values (NAs) in a vector:
library(zoo)
= c(2, NA, 3, 2, 5)
vx
na.approx(vx) # imputes NA values in vx by linear interpolation
We can also use linear interpolation to evaluate a function at some intermediate value for its argument:
= c(1, 2, 3)
vx = c(1, 4, 9)
vy
# We can interpret this as a function y=f(x), for which we have 'observations'
# (1,1), (2,4), (3,9), and find f(2.5) by linear interpolation:
approx(vx, vy, xout = 2.5)
For optimization of a function w.r.t. a scalar argument, it is easiest to use “optimize”:
= function(v) {
fFunToOpt - 3)^2
(v
}= optimize(fFunToOpt, interval = c(-5, 5))
myOptimRes $minimum # minimizer ('argmin')
myOptimRes$objective # function value at minimum ('min') myOptimRes
For optimization of a (scalar-valued) function w.r.t. a vector-valued argument, can use “optim”:
= function(v) {
fFunToOpt 1] - 3)^2 + v[2]^3
(v[
}
= c(0, 0) # initial value
vTheta0 = optim(vTheta0, fFunToOpt, method = "BFGS")
myOptimRes $par # minimizer ('argmin')
myOptimRes$value # function value at minimum ('min')
myOptimRes
# function can support several different methods, see ?optim
# Define a complex number:
= 3 + (0+2i)
b
Mod(b) #modulus of b
Re(b) #real part of b
# As you can see, 'i' is already defined by default in R. That's why it's a
# good idea to avoid creating a variable 'i' yourself, as it would overwrite
# the variable 'i' that exists in R by default. Same for 'pi'.
In the course of this tutorial, we encountered many useful packages. Here are some more.
Package “imfr”: download data directly from IMF databases:
library(imfr)
# see all datasets:
imf_ids(TRUE)
# Have a look at database 'IFS':
= imf_codelist(database_id = "IFS")
a # See what indicators are available in it:
= imf_codes(codelist = "CL_INDICATOR_IFS")
a1
# Download specified indicators from this database for specified countries, a
# specified year range, at specified frequency:
= imf_data(database_id = "IFS", indicator = c("NGDP_R_SA_XDC", "ENDA_XDC_USD_RATE"),
m_RGDP_Q_raw country = c("AT", "AU"), start = "2021", end = "2022", freq = "Q", print_url = TRUE)
Package “countrycode”: translate between different official countrycodes:
library(countrycode)
= c("AT", "AU")
vsC_cc2
# from iso2 to iso3:
countrycode(vsC_cc2, "iso2c", "iso3c")
## [1] "AUT" "AUS"
# from iso2 to country names (in different languages):
countrycode(vsC_cc2, "iso2c", "country.name")
## [1] "Austria" "Australia"
countrycode(vsC_cc2, "iso2c", "country.name.de")
## [1] "Österreich" "Australien"
countrycode(vsC_cc2, "iso2c", "country.name.fr")
## [1] "Autriche" "Australie"