STATS 60 Rlab Session 2

Yuchen Wu

2020/7/1

Project proposal

Agenda for today

Lists

In all the data structures so far, the elements have to be of the same type. To have elements on different types in one data structure, we can use a list, which we create with list(). We can think of a list as a collection of key-value pairs. Keys should be strings.

person <- list(name = "John Doe", age = 26)
person
## $name
## [1] "John Doe"
## 
## $age
## [1] 26

Lists

The str function can be used to inspect what is inside person:

str(person)
## List of 2
##  $ name: chr "John Doe"
##  $ age : num 26

To access the name element person, we have 2 options:

person[["name"]]
## [1] "John Doe"
person$name
## [1] "John Doe"

Lists

The elements of a list can be anything, even another data structure! Let’s add the names of John’s children to the person object:

person$children <- c("Ross", "Robert")
str(person)
## List of 3
##  $ name    : chr "John Doe"
##  $ age     : num 26
##  $ children: chr [1:2] "Ross" "Robert"

To see the keys associated with a list, use the names() function:

names(person)
## [1] "name"     "age"      "children"

What is a data frame?

A special type of list:

Data frame

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Data Frame

We can use “help” menu on the bottom right corner of the Rstudio to check for the meaning of the variable names:

Data Frame

View(mtcars)

Data Frame

head(mtcars)     ## return the first 6 rows of the data set, also works with "tail"
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
names(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
dim(mtcars)
## [1] 32 11

Data Frame

Instead of using built-in data sets, we can also let R read from local files

# First, set your working directory
setwd("~/Desktop/")
carSpeeds <- read.csv(file = 'data/car-speeds.csv')

Can also read data from website!

df <- read.csv("https://stats60.github.io/Rlab/worldbank_data_tidy.csv",
               stringsAsFactors = FALSE)

Words vs. pictures

“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey

library(ggplot2)
p_base <- ggplot(data = df, mapping = aes(y = mpg, x = weight))
p_scatter <- p_base + geom_point(aes(col = cylinders), size = 2)
p_scatter

Two classes of variables in statistics

Barplots: counts for a categorical variable

What is the distribution of cylinders in my dataset?

ggplot(data = mtcars) +
    geom_bar(aes(x = factor(cyl))) +
    ggtitle("Count by cylinders") +
    xlab("No. of cylinders")

Histograms: counts for a continuous variable

What is the distribution of miles per gallon in my dataset?

p_hist <- ggplot(data = mtcars) + 
    geom_histogram(aes(x = mpg), breaks = seq(10, 35, 5)) +
    ggtitle("Histogram of miles per gallon")
p_hist

ggplot(data = mtcars) + 
    geom_histogram(aes(x = mpg)) +
    ggtitle("Histogram of miles per gallon")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Scatterplots: continuous variable vs. continuous variable

What is the relationship between mpg and weight?

ggplot(data = df) + 
    geom_point(mapping = aes(y = mpg, x = weight), size = 2) + 
    ggtitle("Miles per gallon vs. weight")

Lineplots: continuous variable vs. time variable

What is the relationship between mpg and time?

Boxplots

For each value of cylinder, what is the distribution of mpg like?

ggplot(data = df) + 
    geom_boxplot(aes(cylinders,mpg)) +
    ggtitle("Distribution of mpg by cylinders")

Violinplots

For each value of cylinder, what is the distribution of mpg like?

ggplot(data = df) + 
    geom_violin(aes(cylinders,mpg)) +
    ggtitle("Distribution of mpg by cylinders")

Heatmaps: categorical variable vs. categorical variable

How often does each pair of cylinder and gear occur in the dataset?

Summary

Data visualization in R: 2 broad approaches

base R

Data visualization in R: 2 broad approaches

ggplot2

3 essential elements of graphics: data, geometries, aesthetics

Data: Dataset we are using for the plot

##     mpg weight cylinders
## 1  21.0  2.620         6
## 2  21.0  2.875         6
## 3  22.8  2.320         4
## 4  21.4  3.215         6
## 5  18.7  3.440         8
## 6  18.1  3.460         6
## 7  14.3  3.570         8
## 8  24.4  3.190         4
## 9  22.8  3.150         4
## 10 19.2  3.440         6

3 essential elements of graphics: data, geometries, aesthetics

Geometries: Visual elements used for our data

Geom: point

3 essential elements of graphics: data, geometries, aesthetics

Aesthetics: Defines the data columns which affect various aspects of the geom

3 different aesthetics:

Examples of other aesthetics

p_base + geom_point(aes(size = cylinders, alpha = weight))

Examples of other aesthetics

p_base + geom_point(aes(col = cylinders, shape=cylinders), size = 3)

ggplot2 code

ggplot()

ggplot2 code

ggplot() +
    geom_histogram(data = df, mapping = aes(x = mpg))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot2 code

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg))

ggplot2 code

ggplot() +
    geom_point(data = df, 
               mapping = aes(x = weight, y = mpg, col = cylinders),
               shape = 15)

Layers: Combining multiple plots into one graphic

We can have more than one layer in a graphic.

= +

Each layer contains (essentially):

ggplot2 code

ggplot() +
    geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_point(data = df, mapping = aes(x = cylinders, y = mpg), 
               position = "jitter")

ggplot2 code

When layers share attributes, we only have to type them once:

ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

ggplot2 code

ggplot(df, aes(x = cylinders, y = mpg)) +
    geom_boxplot() +
    geom_point(position = "jitter")

Scales

Examples of scales (Source: A Layered Grammar of Graphics)

Scales example: colors

Manually chosen colors

p_scatter + scale_color_manual(values=c("gold2", "darkorange","firebrick"))

Facets

p_scatter + facet_wrap(~cylinders)

Themes

Refers to all non-data ink

ggplot2’s default theme

p_scatter

Minimal theme

p_scatter + theme_minimal()

More pre-set themes

Classic theme

p_scatter + theme_classic()

More pre-set themes

Dark theme

p_scatter + theme_dark()

We’ve only scratched the surface!

R Graph Gallery: an excellent source of inspiration and code snippet examples