This workshop is designed to prepare students for learning statistical analysis in \(\mathcal{R}\). The materials start with the very first step of installling \(\mathcal{R}\) and \(\mathcal{R}\)-Studi. After getting comfortable with the \(\mathcal{R}\) and R-Studio interface, the students learn about different types of objects. One of the benefits of using \(\mathcal{R}\), as a free and open source programming language, is having accees to the world of \(\mathcal{R}\) packages, which are developed by scholars and engineers to faciliate computational analysis from advanced regression analysis and big-data managemnet tools to data visulization and text anlaysis. Also, even, there are \(\mathcal{R}\) packages for qualitative analysis. Therfore, the next topic is learning about installing and loading \(\mathcal{R}\) packages. We will continue with the methods of reading and loading data. The first part concludes by covering the basics of data visualization, which includes introducing \({\tt ggplot}\) package.
In the second session, we will learn about the basics of modern data mangament using \({\tt dplyr}\) and \({\tt tidyr}\). We will wrap up the wrokshop by learning about preparing a replicable research using R Amrkdown and GitHub.
\(\mathcal{R}\) is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the \(\mathcal{S}\) language and environment developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. \(\mathcal{R}\) can be considered as a different implementation of \(\mathcal{S}\). There are some important differences, but much code written for \(\mathcal{S}\) runs unaltered under \(\mathcal{R}\).
\(\mathcal{R}\) provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The \(\mathcal{S}\) language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of \(\mathcal{R}\)’s strengths is the ease with which well-designed publication quality plots can be produced, including mathematical symbols and formula where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. \(\mathcal{R}\)-Studio is a free and open source Integrated Development Environment (IDE) for R, a programming language for statistical computing and graphics.
\(\mathcal{R}\)-Studio is available in two editions: \(\mathcal{R}\)-Studio Desktop, where the program is run locally as a regular desktop application; and \(\mathcal{R}\)-Studio Server, which allows accessing \(\mathcal{R}\)-Studio using a web browser while it is running on a remote Linux server. Prepackaged distributions of \(\mathcal{R}\)-Studio Desktop are available for MicrosoftWindows, Mac OS X, and Linux.
\(\mathcal{R}\)-Studio is written in the C++ programming language and uses the Qt framework for its graphical user interface. Work on \(\mathcal{R}\)-Studio started at around December 2010 , and the first public BETA version (v0.92) was offcially announced in February 2011.
For downloading \(\mathcal{R}\) go to: https://packages.othr.de/cran/, choose " install R for the first time.", and there are install download links for Linux, Mac(OS), and Windows.
For downloading \(\mathcal{R}\)-Studio go to: http://www.rstudio.com/products/RStudio , and there are both Desktop (recommended) and Server version. Now, install \(\mathcal{R}\) and \(\mathcal{R}\)-Studio on your laptops! Please go to above links and follow the instructions for downloading and installing \(\mathcal{R}\) and \(\mathcal{R}\)-Studio.
Now, install \(\mathcal{R}\) and \(\mathcal{R}\)-Studio on your laptops! Please go to above links and follow the instructions for downloading and installing \(\mathcal{R}\) and \(\mathcal{R}\)-Studio.
Now open \(\mathcal{R}\) software! After \(\mathcal{R}\) is started, there is a console awaiting for input. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. Type following simple calculations and commands, and see what are the results:
print('Hello, world')
2+2
print(2+2)
print('2+2')
Now, type below comment:
print(Hello, world!)
Did you get any error message? Why does that happen?
Now, on File tab, choose new script. And, redo what you did on Console window. In script window, for running the commands you can either use Ctrl + R (Cmmd + R in Mac) or make a right click and choose Run line or selection.
Could you tell what is the difference in running a command in R console and \(\mathcal{R}\)-Editor (\(\mathcal{R}\) Script)?
Now open \(\mathcal{R}\)-Studio software and repeat what you did in \(\mathcal{R}\)-Studio. (You can choose to use either \(\mathcal{R}\)-Studio or R for remaining part of this workshop based on your preferences. But, for now, we want to get familiar with the interface of both software! For benefiting the advantages of embedding \(\mathcal{R}\) in or HTML, it is preferred to use \(\mathcal{R}\)-Studio.)
Results of calculations can be stored in objects using the assignment operators:
An arrow (<-) formed by a smaller than character and a hyphen without a space!
The equal character (=).
These objects can then be used in other calculations. To print the object just enter the name of the object. There are some restrictions when giving an object a name:
x=1
print(x)
x<-1
print(x)
x<-2
print(x)
X=2
x+X
xX=x+X
print(xX)
y='a'
print(y)
w="a"
print(w)
z<-'Which one do you prefer? R or R-Studio'
print(z)
How can you drop/delete/remove an object? Try \({\tt rm()}\) command!
rm(x)
print(x)
rm(z)
print(z)
Now, create two small vectors with data. The following apply the function \({\tt c()}\) to combine three numeric values into a vector.
V1=c(1,2,3)
V2=c(1,2,'a')
print(V1)
print(V2)
Then, make one vector which include \(1,2,3,4,\) and \(5\). Store this vector, and name it \(x1\).
c(1,2,3,4,5)
x1<-c(1,2,3,4,5)
Now try \({\tt x2<-c(1:5)}\) and name it \(x2\).
x2<-c(6:10)
print(c1)
print(x2)
All text after the pound sign “#” within the same line is considered a comment.
\(\mathcal{R}\) provides extensive documentations. For example, entering \({\tt ?c}\) or \({\tt help(c)}\) at the prompt gives documentation of the function c in R. Give it a try!
?c
help(c)
You learned about \(\mathcal{R}\)’ official help command. However, I personally hardly use it. What I do? I use the best search tool availabe on Google! This workshop, and generally this course, gives you the first push and initial momentum to learning \(\mathcal{R}\). I help you to pass the steep part of its learning curve. After this, you should learn search and find answers to your questions online. I am teaching this course and help you to figure out the problems and find answers to your questions. However, afterward, you should develop the ability to troubleshoot and debug your codes. another benfirt of \(\mathcal{R}\) is that many scholars are using it out there, and it is very likely what you want to do is already asked and answered. \(\tt http://stackoverflow.com/\), for instance, is full of fourms answering questions about coding issues in \(\mathcal{R}\).
Try:Let’s give it a try here. Generate a sequnce of number between 1 and 100, with increment value of 5.
Vectors can be combined via the function \({\tt c}\). For examples, the following two vectors \(n\) and \(s\) are combined into a new vector containing elements from both vectors.
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
c(n, s)
Arithmetic operations of vectors are performed member-by-member, i.e., member-wise.
For example, suppose we have two vectors a and b.
a = c(1, 3, 5, 7)
b = c(1, 2, 4, 8)
Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5.
5 * a
And if we add \(a\) and \(b\) together, the sum would be \(a\) vector whose members are the sum of the corresponding members from \(a\) and \(b\).
a + b
Similarly for subtraction, multiplication, and division, we get new vectors via member wise operations.
a - b
a * b
a / b
Recycling Rule: If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the following vectors \(u\) and \(v\) have different lengths, and their sum is computed by recycling values of the shorter vector \(u\).
u = c(10, 20, 30)
v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
u + v
R has several operators to perform tasks.
Operator | Description |
---|---|
+ | Addition |
- | Subtraction |
\(*\) | Multiplication |
/ | Division |
^ | Exponent |
Operator | Description |
---|---|
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to |
!= | Not equal to |
Some examples:
x = 10
y<- 16
# The operators <- and = can be used, almost interchangeably, to assign to variable in the same environment.
x<y
x>y
x<= 12
y>=25
y==14
x==10
x!=10
We will need to use Boolean operations like \({\tt AND}\) and \({\tt OR}\)1:
Operator | Description |
---|---|
! | Logical NOT |
& | Logical AND |
| | Logical OR |
There are various ways to construct a matrix. When we construct a matrix directly with data elements, the matrix content is filled along the column orientation by default. For example, in the following code snippet, the content of \(B\) is filled along the columns consecutively.
B = matrix(
c(2, 4, 3, 1, 5, 7),
nrow=3,
ncol=2)
B # B has 3 rows and 2 columns
Transpose: We construct the transpose of a matrix by interchanging its columns and rows with the function \({\tt t()}\).
t(B) # transpose of B
B<-t(B)
Combining Matrices: The columns of two matrices having the same number of rows can be combined into a larger matrix. For example, suppose we have another matrix \(C\) also with 3 rows.
C = matrix(
c(7, 4, 2),
nrow=3,
ncol=1)
C # C has 3 rows
Then we can combine the columns of \(B\) and \(C\) with \({\tt cbind()}\).
cbind(B, C)
Similarly, we can combine the rows of two matrices if they have the same number of columns with the \({\tt rbind()}\) function.
D = matrix(
c(6, 2),
nrow=1,
ncol=2)
D # D has 2 columns
rbind(B, D)
In addition to numerical values, as we will discuss more in class, variables can be categorical. For example, we can categorize regime types to three groups: democracy, anocracy, and authocracy. We can save these types of variables into a vector as well:
RegimeType=c("democracy", "anocracy", "authocracy")
You can check the type of a variable using \({\tt typeof()}\) command:
typeof(RegimeType)
## [1] "character"
These types of variables are modeled in regression analysis as factor/dummy variables. We need to tell \(\mathcal{R}\) to treat these variables as a factor variable:
as.factor(RegimeType)
## [1] democracy anocracy authocracy
## Levels: anocracy authocracy democracy
# To save the new format of the variable/object:
RegimeType=as.factor(RegimeType)
Importing data into R is fairly simple. For STATA, use the \({\tt Foreign}\) package. For SPSS and SAS I would recommend the Hmisc package for ease and functionality. See the Quick-\(\mathcal{R}\) section on these packages, for information on obtaining and installing the these packages. Before working with some examples of importing data, we need to learn about \(\mathcal{R}\)-packages andWorking Directories.
\(\mathcal{R}\)-packages are reproducible and reusable \(\mathcal{R}\)-Codes written, tested, and confirmed by \(\mathcal{R}\)-community. To use an \(\mathcal{R}\)-package, you need to first download and install it. Assume, we want to install package Foreign, type following syntax and run it to see the results:
install.packages("foreign")
You also can install multiple package with one syntax. The following syntax installs both foreign and Hmisc packages.
install.packages(c("foreign", "ggplot2"))
Take-Home: You can install \(\mathcal{R}\)-packages through the menu as well, how?
For using \(\mathcal{R}\)-packages, you need to install them only once, but for using them, you need to call them every time you open your \(\mathcal{R}\)/\(\mathcal{R}\)-Studio. To call/load the installed packages in your \(\mathcal{R}\)-library, you can use following syntax:
library(foreign)
library(ggplot2)
Take-Home Exercise: How can we load multiple packages at once? Hint: One way is using a for-loop.
##Working Directories
There are two ways to set your working directory: 1. Through the menu * InWindows: go to the File menu, select ChangeWorking Directory, and select the appropriate folder/directory * In Macs: go to the Misc menu, select ChangeWorking Directory, and select the appropriate folder/directory
setwd("...")
in which, the “…” is the specific pathway, e.g., inWindows:
setwd("C:/Users/User Name/Documents/FOLDER")
in Macs:
setwd("/Users/User Name/Documents/FOLDER")
Download the dataset on the ecdc geographic distribution of COVID-19 cases worldwide from my DropBox page: Corona Data
setwd("C:\\Users\\Babak-Lenovo2017\\Downloads")
Set your working directory accordingly, and load the CSV file you downloaded as follow:
setwd("C:\\Users\\Babak-Lenovo2017\\Downloads")
Corona=read.csv("Corona_WorldData_Total_Apr15_ecdc.csv")
MyData=read.csv("Corona_WorldData_Total_Apr15_ecdc.csv")
You can see all the objects that you created under “values” window on the north-east of \(\mathcal{R}\)-Studio. Also, you use \({\tt View()}\) (make sure V is uppercase):
I personally prefer to keep my datasets on a cloud repository such as Dropbox or GitHub. You can directly load the data from these cloud services to your \(\mathcal{R}\)-Studio environment. I will give you an assignment about this!
Let’s start with getting some more detialed informat8on about the data that you loaded:
head(Corona)
names(Corona)
nrow(Corona)
ncol(Corona)
summary(Corona)
There are different methods to accee each variable in a dataset:
Corona$Country
Corona[,"Country"]
Corona[,1]
Corona[,c("Country","ISOcode")]
Corona[,1:2]
Corona[,c(1,3)]
Operator \(\$\) is genrally used in \(\mathcal{R}\) for accessing to the elements of a list object! Here, we use it to reach a variable in a dataset, or as \(\mathcal{R}\) calls it a data.frame.
Similarly, you can access a specific observation:
Corona[1,]
Corona[c(1,2),]
Now, let’s how we can report the summary of a variable, not a dataframe:
summary(Corona$cases)
summary(Corona[,3:4] )
summary(Corona[,c("cases","deaths")] )
Now, assume that you want to add a new variable to your data.frame. For example, you decided to work with the log of death cases. To create this new variable:
Corona$deaths_log=log10(1+Corona$deaths)
Sometimes, we need to create categorical variables from continious ones according to some rules and conditiins. For example, I want to categorize the countries to those with less and more than 1000 deaths:
Corona$deaths_1k <- ifelse(Corona$deaths> 1000, 1, 0)
To have some idea about how ifelse function works here, run below syntax:
Corona$deaths> 1000
Assume that I only want to study the countries in sample that have at leats 1000 deaths. This is called subsetting:
subset(Corona, Corona$death>1000 )
# If you want to save its results in a new object (data.frame), then:
Corona_1000=subset(Corona, Corona$death>1000 )
There are two main approaches of creating plots in \(\mathcal{R}\). Some rely on the \(\mathcal{R}\)’s built-in commands to create their plots, and some others use a well-know package: \({\tt ggplot2}\). First, we focus on the former groups, and later, we will cover the latter one.
In class, I showed you several scatter plots. Creating a scatter plot is pretty muc straighforward in \(\mathcal{R}\). Assume that we are interested in studying the association between the number of Corona cases in each country and the number of deaths that it caused:
plot(Corona$cases,Corona$deaths)
This the most basic plot that you can get. However, the plots that I showed in class are nicer and cleaner, and you always must make sure that you spend enough time to prepare nice and informative plots for your papers and presentations. Now, run below codes and see what changes:
plot(Corona$cases_log,Corona$deaths, col="red")
Now, this:]
plot(Corona$cases,Corona$deaths, pch=19)
plot(Corona$cases,Corona$deaths, pch=17)
On this webpage, you can see different types of markers
Now, mix this two additil options.
plot(Corona$cases,Corona$deaths, pch=17, col="red")
You can add some other options to modify other elements of a plot:
plot(Corona$cases,Corona$deaths, pch=17, col="red",
xlab="Corona cases", ylab="Deaths",
main="Corona case vs. deaths")
Adding to the debates about R vs. STATA and Python vs. R, there has been a discussion among \(\mathcal{R}\) users about \(\mathcal{R}\)’s bult-in \({\tt plot()}\) command and the well-known \({\tt ggplot}\) package. I don’t have any preferences and use them wherever they fit the job that I want to get done, so you also feel free to use the one that you feel more comfortable to use.
To use \({\tt ggplot}\), you first need to install it, if you haven’t done it yet. Then, we can call and use it for plotting our plots.
library(ggplot2)
Corona$deaths_log=log10(1+Corona$deaths)
ggplot(data = Corona, aes(x = cases_log, y = deaths_log)) +
geom_point(alpha = 0.4, color = "maroon")
In addition to scatterplot, there are other types informative plots that can be helpful in our class and your research. In class, I often talk about the distribution of a variable. You can visualize the distribution of a variable by plotting its histogram:
hist(Corona$popdata2018, breaks=10)
Scholars sometime use a logarithmic transformations to make the changes in data smoother and decrease the gaps between observations. Now, let’s plot the logarithm of population and see how its distribution changes:
hist(Corona$popdata2018_log, breaks=10)
# Let's try one of my favorite colors: maroon
hist(Corona$popdata2018_log, breaks=10,
col='maroon')
Let’s try it in \({\tt ggplot}\):
ggplot(Corona,
aes(x = popdata2018_log)) +
geom_histogram()
# With more bins
ggplot(Corona,
aes(x = popdata2018_log)) +
geom_histogram(bins = 10, col='gold', fill='maroon')
## Play with the number of bins and see how the plots change
Another popular plot of a variable’s distribution is density plot.
ggplot(Corona,
aes(x = popdata2018)) +
geom_density()
# try the next one:
ggplot(Corona,
aes(x = popdata2018_log)) +
geom_density(col='maroon', lwd=2)
A plot always should be annotated properly. That is, it \(x\)- and \(y\)- axis should be labled properly, and its title should be informative. Here is how we can change the defalut, and often uninformaive, lables to a cleaner and more informative version:
ggplot(Corona,
aes(x = deaths_log)) +
geom_density(col='maroon', lwd=2)+
labs(title = "Distribution of corona deaths",
#subtitle = "You may need to add subtitles",
x = "Corona deaths",
y = "Density",
caption="Data sources: https://www.ecdc.europa.eu/en")
Before running regression models, it is always helpful to check how variables of interes- for example, dependent abd independent variables- are associted. Assume that you want to see how the population size is associated with whether a country has more than 1000 deaths. One option is scatterplot:
plot(Corona$popdata2018_log, Corona$deaths_1k)
This plot doesn’t look much informative, in such cases with a categorical variable, it is usually better to use the boxplot:
boxplot(Corona$popdata2018_log~Corona$deaths_1k,
col='gold')
# Make Plots Great Again (MPGA):
boxplot(Corona$popdata2018_log~Corona$deaths_1k,
col='gold',
names=c("less than 1k","more than 1k"),
xlab='Deaths', ylab='Population')
There are differenves between element-wise logical and logical operaion that we skip here, but you can find about with a simple search on the Internet!↩