This document is meant to show you some useful code that we will be using in this class. However, it is not meant to be a comprehensive list of tips and tricks for R users. If you want a deeper dive into R, please check the external resources that have been added on the course website.
For this class, make sure you have the most updated version of R (you
can check your R version by typing version
on the console).
If your version is older than R 4.2.1
, please download and
install the latest R version.
Make sure that you are starting with a clean space when you open
RStudio. This will save memory and also minimize potential problems! For
this, go to Tools -> Global Options -> General
. As
you can see from the image below, make sure you are not loading your
previous .RData into your space every time you open R Studio (make sure
that option is not selected). Tip: If you want your
previous R Scripts to be loaded the next time you open RStudio, you can
check the two options above Workspace.
It’s always good practice to have some annotations on your code, especially on the header, so future you remembers what past you was doing (sometimes they don’t get along). This is personal preference, but for simple projects, I tend to have a descriptive header as following:
################################################################################
### Title: My Project
### Author: Magdalena Bennett
### Created on: 2022/08/04
### Last edit: 2022/08/16
### Edit: [MB] - Created script
### Edit: [MB] - Included additional data
################################################################################
# Clears memory
rm(list = ls())
# Clears console
cat("\014")
This is particularly helpful when you are working with collaborators, so you can keep track of the main edits.
Another slightly more sofisticated but incredibly useful way of keep track of changes in R scripts is using Github. This goes beyond the scope of this document, but I highly suggest you look into it if you plan of coding more frequently (Tip: Look into Github Desktop if you’ve never used git).
Note that I also include code to clear the memory and the console, to make sure I’m starting from scratch when running code. This is very useful to make sure your code is reproducible, because others will most likely start with a clear memory as well.
One good practice is to load all required packages for your code at the top of your script. For example, for this document I will be using the following packages:
library(tidyverse)
library(patchwork)
library(vtable)
library(stargazer)
library(AER)
Make sure your packages are installed, or the previous code will give you an error. If one or more of the packages are not installed, you need to install them only once as following:
install.packages("AER")
Do not include this line of code on your R script. If you do, every time you run the code, you will be re-installing the package and that is not necessary (also it may lead to errors or inconsistencies if the package is updated).
For this course, we will almost always be using the
tidyverse
package, which include an array of other packages
as well (e.g. dplyr
, ggplot2
, etc.). If you
load the tidyverse
, there’s no need to include these other
ones. Also, make sure you load the packages you need, and not every
package you have ever used. Keep your code clean.
How you load your data depends on how your data is stored (csv? json? dta?). In this course, we will usually use CSV data (Comma-Separated Values), which you can load as following:
d <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/UsefulRCode/data/cereal.csv")
In this case, we are loading the data directly from a web repository,
but you can also load it with local data (e.g. changing the file to
"C:/Documents/DS/cereal.csv"
).
Note: Always make sure you look at your dataset after you load it, to make sure you are reading it correctly.
If you are using other type of data, there other packages that might
be useful, such as haven
, readxl
, or
rjson
, among others.
Now that we have setup our session and have our data loaded, let’s
start wrangling the data (e.g. setting it up for analysis). We will be
using the cereal.csv
data we previously loaded, and you can
check out more information about this dataset here.
We can look at some of the data with plots. For example, a histogram:
ggplot(data = d, aes(x = rating)) +
geom_histogram(color = "skyblue", fill = "white", lwd = 1.1, bins = 30) +
xlab("Rating") + ylab("Count") +
ggtitle("Rating for Cereals") +
theme_minimal()
In ggplot
, we always load the data and pass the
aesthetics (e.g. variables you want to plot). There are many options to
personalize your plots! Make sure they are readable and provide the
information you want to convey. For that, titles, captions, axis labels,
scale, etc., are always important. In this case, looks ARE
important (in order to convey information).
Besides histograms, we can have density plots
(geom_density()
), scatter plots
(geom_point()
), and line plots (geom_line
),
among many, many others.
For example, let’s look at a scatter plot between sugars
and rating
:
ggplot(data = d, aes(y = rating, x = sugars)) +
geom_point(color = "skyblue") +
xlab("Sugars") + ylab("Ratings") +
ggtitle("Rating vs Sugars") +
theme_minimal()
And now let’s look at the same plot, but for cereals below and above
100 calories per portion. Notice how I create a new variable
(over100ca
), and use it to color different points
(Question: What happens if I don’t use
factor(over100ca)
and use over100ca
directly?):
d %>% mutate(over100ca = ifelse(calories>100,1,0)) %>%
ggplot(data = ., aes(y = rating, x = sugars, color = factor(over100ca))) +
geom_point() +
xlab("Sugars") + ylab("Ratings") +
# scale_X_manual() allows you to include specific colors, titles, and labels!
scale_color_manual("Over 100 calories?", values = c("skyblue","purple"),
label = c("No","Yes")) +
ggtitle("Rating vs Sugars") +
theme_minimal()
We can also show two plots side-by-side using the
patchwork
package:
hist1 <- ggplot(data = d, aes(x = rating)) +
geom_histogram(color = "skyblue", fill = "white", lwd = 1.1, bins = 30) +
xlab("Rating") + ylab("Count") +
ggtitle("Rating for Cereals") +
theme_minimal()
scat1 <- d %>% mutate(over100ca = ifelse(calories>100,1,0)) %>%
ggplot(data = ., aes(y = rating, x = sugars, color = factor(over100ca))) +
geom_point() +
xlab("Sugars") + ylab("Ratings") +
scale_color_manual("Over 100 calories?", values = c("skyblue","purple"),
label = c("No","Yes")) +
ggtitle("Rating vs Sugars") +
theme_minimal() +
theme(legend.position = c(0.8,0.9))
hist1 + scat1
Some arguments that might be useful to have in mind:
color
: Refers to the outline color of objects. If
fill
is not defined, usually both take the same
argument.fill
: Refers to what color objects are filled
(e.g. bars). In a scatter plot, it only works if the marker has a
different outline.pch
: In a scatter plot, defines the type of marker you
are using (e.g. circles, squares, triangles). Check out the different
options herelwd
: Line width for a line or a bar plot, for
example.lty
: Line type (1: solid, 2: dashed, etc.).In terms of aesthetics, aes()
:
x
: X-axis variabley
: Y-axis variablecolor
, fill
: Variable that will define the
color of lines or markers (e.g. if you want to distinguish between two
groups)Also, please always remember to label your axis!!
One thing that you might want to do when inspecting your data is to know how many observations/groups you have available. For example, how many different manufacturers do we have and do they make higher or lower calories cereal?
For this, the group_by
function could be very useful.
What it does is group your dataset by whatever variable(s) you want, and
then do operations at that group-level (e.g. mutate or summarize):
d %>% group_by(mfr) %>% summarize(n_cereals = n(),
mean_calories = mean(calories))
mfr | n_cereals | mean_calories |
---|---|---|
A | 1 | 100.00000 |
G | 22 | 111.36364 |
K | 23 | 108.69565 |
N | 6 | 86.66667 |
P | 9 | 108.88889 |
Q | 8 | 95.00000 |
R | 8 | 115.00000 |
From the previous code, you can see that now we are creating two
new variables: n_cereals
, which is the number of different
cereals in our dataset by manufacturer, and
mean_calories
, which captures the average calories of
cereals made by that manufacturer.
One important task (and usually the one that takes the most time) is cleaning your data. In this section, we will review some of the main functions you will need to remove observations, create new variables, change existing ones, etc.
Some useful functions:
filter(exp)
: This function helps us select a subset
of observations that comply with the logical expression exp
within.
is.na(var)
: Applied to a variable var
,
returns a logic vector (TRUE
or FALSE
),
depending on whether that observation is missing or not.
mutate(var = exp)
: This is the main function we will
use to create new variables or replace existing ones (if we use a
variable name that is already in the datasete).
mean(var, na.rm = TRUE)
: Returns the mean of a
variable var
, excluding missing values from the calculation
(that’s why we use na.rm=TRUE
). If you don’t want to
exclude them, use na.rm=FALSE
(it’s a good way to double
check whether you have missing values or not).
ifelse(test, yes, no)
: Returns a value with the same
dimensions as test
(a logical expression), but filled with
the yes
elements when test==TRUE
and
no
elements when test==FALSE
.
For merging datasets, you will need at least two dataframes, and a
common variable that you want to use for merging
(e.g. id
).
The following figure shows the more typical functions for merging
datasets, available on the tidyverse
:
As an example, let’s load two datasets: (1) one at the zipcode
level, and (2) another one at the individual level.
zipcode <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/UsefulRCode/data/zipcode.csv")
individual <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/UsefulRCode/data/individual.csv")
head(individual)
personid | zipcode | gender |
---|---|---|
1 | 10025 | F |
2 | 10025 | F |
3 | 10025 | M |
4 | 11222 | F |
5 | 11222 | M |
6 | 10027 | M |
We will merge both datasets by their zipcode
(Note: make sure both variables are named the same and have the same
format!), and keep all observations in the individual
dataset:
individual <- left_join(individual, zipcode, by = "zipcode")
head(individual)
personid | zipcode | gender | n |
---|---|---|---|
1 | 10025 | F | 100892 |
2 | 10025 | F | 100892 |
3 | 10025 | M | 100892 |
4 | 11222 | F | 67339 |
5 | 11222 | M | 67339 |
6 | 10027 | M | 111865 |
We will not go into depth in terms of how to analyze data (check slides and code provided for each class), but one relevant tip for assignments is how to present your results in an appropriate way.
For that, we will use the stargazer
package to format
tables.
Besides what we will see in class, you can use different packages to export tables in a more “format-friendly” way. Here, I’ll review just a couple of packages that can be useful for this purpose.
We will mention this package in class as a way to quickly inspect
your data, but you can also create summary tables with the
sumtable()
function.
cereal <- read.csv("https://raw.githubusercontent.com/maibennett/sta235/main/exampleSite/content/UsefulRCode/data/cereal.csv")
sumtable(cereal, digits = 2, title = "Summary statistics for Cereal Data")
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
type | 77 | ||||||
… C | 74 | 96% | |||||
… H | 3 | 4% | |||||
calories | 77 | 106.88 | 19.48 | 50 | 100 | 110 | 160 |
protein | 77 | 2.55 | 1.09 | 1 | 2 | 3 | 6 |
fat | 77 | 1.01 | 1.01 | 0 | 0 | 2 | 5 |
sodium | 77 | 159.68 | 83.83 | 0 | 130 | 210 | 320 |
fiber | 77 | 2.15 | 2.38 | 0 | 1 | 3 | 14 |
carbo | 77 | 14.6 | 4.28 | -1 | 12 | 17 | 23 |
sugars | 77 | 6.92 | 4.44 | -1 | 3 | 11 | 15 |
potass | 77 | 96.08 | 71.29 | -1 | 40 | 120 | 330 |
vitamins | 77 | 28.25 | 22.34 | 0 | 25 | 25 | 100 |
shelf | 77 | 2.21 | 0.83 | 1 | 1 | 3 | 3 |
weight | 77 | 1.03 | 0.15 | 0.5 | 1 | 1 | 1.5 |
cups | 77 | 0.82 | 0.23 | 0.25 | 0.67 | 1 | 1.5 |
rating | 77 | 42.67 | 14.05 | 18.04 | 33.17 | 50.83 | 93.7 |
You might also want to have nice looking tables for a report, so renaming your variables might be a good idea (for the table). For example, you can do it as following:
cereal_select <- cereal %>% select(name, type, calories, protein, fat, sodium, sugars)
names(cereal_select) <- c("Cereal Name", "Type of Cereal", "Calories (Kca)", "Protein (grs)", "Fat (grs)", "Sodium (mgrs)", "Sugars (grs)")
tab <- sumtable(cereal_select, digits = 2,
title = "Summary statistics for Cereal Data",
out = "return")
tab %>% kbl() %>%
kable_paper("hover", full_width = F)
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
Type of Cereal | 77 | ||||||
… C | 74 | 96% | |||||
… H | 3 | 4% | |||||
Calories (Kca) | 77 | 106.88 | 19.48 | 50 | 100 | 110 | 160 |
Protein (grs) | 77 | 2.55 | 1.09 | 1 | 2 | 3 | 6 |
Fat (grs) | 77 | 1.01 | 1.01 | 0 | 0 | 2 | 5 |
Sodium (mgrs) | 77 | 159.68 | 83.83 | 0 | 130 | 210 | 320 |
Sugars (grs) | 77 | 6.92 | 4.44 | -1 | 3 | 11 | 15 |
You can also export the table to a CSV file using the arguments
out = "csv"
and file
, or an HTML file to keep
the formatting.
sumtable(cereal, out = "csv", file = "C:/DS/sumtable_cereal.csv", digits = 2,
title = "Summary statistics for Cereal Data")
sumtable(cereal, out = "html", file = "C:/DS/sumtable_cereal.html", digits = 2,
title = "Summary statistics for Cereal Data")
Check out more examples here
to customize labels, groups, etc.
You can also use the stargazer
package to create similar
tables. For example:
stargazer(cereal, title = "Summary statistics for Cereal Data", header = FALSE,
type = "html")
Statistic | N | Mean | St. Dev. | Min | Max |
calories | 77 | 106.883 | 19.484 | 50 | 160 |
protein | 77 | 2.545 | 1.095 | 1 | 6 |
fat | 77 | 1.013 | 1.006 | 0 | 5 |
sodium | 77 | 159.675 | 83.832 | 0 | 320 |
fiber | 77 | 2.152 | 2.383 | 0.000 | 14.000 |
carbo | 77 | 14.597 | 4.279 | -1.000 | 23.000 |
sugars | 77 | 6.922 | 4.445 | -1 | 15 |
potass | 77 | 96.078 | 71.287 | -1 | 330 |
vitamins | 77 | 28.247 | 22.343 | 0 | 100 |
shelf | 77 | 2.208 | 0.833 | 1 | 3 |
weight | 77 | 1.030 | 0.150 | 0.500 | 1.500 |
cups | 77 | 0.821 | 0.233 | 0.250 | 1.500 |
rating | 77 | 42.666 | 14.047 | 18.043 | 93.705 |
You can also use this package for regression results. For example, if we have three models:
lm0 <- lm(rating ~ sugars, data = d)
lm1 <- lm(rating ~ sugars + fat, data = d)
lm2 <- lm(rating ~ sugars + fat + fiber + cups, data = d)
stargazer(lm0, lm1, lm2, header = FALSE,
title="Regressions results for Cereal Ratings")
Dependent variable: | |||
rating | |||
(1) | (2) | (3) | |
sugars | -2.401*** | -2.213*** | -1.967*** |
(0.237) | (0.235) | (0.143) | |
fat | -3.066*** | -3.511*** | |
(1.036) | (0.634) | ||
fiber | 2.906*** | ||
(0.301) | |||
cups | -0.882 | ||
(3.093) | |||
Constant | 59.284*** | 61.089*** | 54.310*** |
(1.948) | (1.953) | (3.344) | |
Observations | 77 | 77 | 77 |
R2 | 0.577 | 0.622 | 0.867 |
Adjusted R2 | 0.571 | 0.612 | 0.859 |
Residual Std. Error | 9.196 (df = 75) | 8.755 (df = 74) | 5.269 (df = 72) |
F Statistic | 102.349*** (df = 1; 75) | 60.836*** (df = 2; 74) | 117.065*** (df = 4; 72) |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Same as before, you can also export these tables, but stargazer does have a more limited capability:
stargazer(lm0, lm1, lm2,
title="Regressions results for Cereal Ratings",
out = "C:/DS/reg_cereals.html")
If you want to keep the format of the table and be able to edit it in your document, one way is to export html into a .doc file, but you have to make small modifications to the p-value legends:
notes ="***: p<a><</a>0.01; **: p<a><</a>0.05; *: p<a><</a>0.1"
stargazer(lm0, lm1, lm2,
title="Regressions results for Cereal Ratings", type = "html",
out = "C:/DS/reg_cereals.doc",
notes = notes, notes.append=F)
There are other packages out there that could also be useful
(e.g. flextable
or gtsummary
), but in my
opinion, the two presented here are some of the more beginner-friendly.
I encourage you, though, to explore the different arguments and options
to personalize your tables appropriately! Remember that the information
you present needs to convey a message, so it needs to be clear.
One important thing to learn is how to find solutions when your code
is not running. Here are some tips:
Start from a clean environment: Run your entire code. Does it throw an error? Identify which line.
Is this an error or a warning?: Both of them are important, but warnings allow you to run your code, while errors don’t. Read warnings to make sure that is something that you shouldn’t be concerned about.
Read your error message: Sometimes the solution is on the error itself! If it tells you that the function can’t run because there are missing values, inspect your data to see what they are.
Google is your friend: If you can’t identify the problem from the error message itself, google it! Most likely someone else has ran into the same issue and have posted the question on StackOverflow, for example. Use that collective knowledge to help you find answers.
I still don’t know what’s wrong: If you’ve already read, googled, tried different solutions, and still feel stuck (e.g. > 10 min), reach out to the instruction team! We are more than happy to guide you to a solution. Make sure to let us know what you have tried, but don’t spend too much time stuck.
With coding, you learn something new every day. New packages come out, functions sometimes change, new problems require different skills, but the idea is that you learn how to teach yourself how to code, regardless of the language or software. It can be frustrating, but it is an incredible useful skill and your hard work will certainly pay!
There are a lot of things we can do in R, and in this class we are going to dive into a lot of them. Make sure you check the weekly R scripts that are posted with each class and play around with them. Add some of these tricks, try new functions, search for errors, and be curious!
The instruction team is always here to help you, so be sure to reach out if you are having any issues.
Happy coding!