STATS 60 Rlab Session 3

Yuchen Wu

2020/7/8

Recap of week 2: data visualization

ggplot2 syntax

library(ggplot2)
ggplot()

ggplot2 syntax

ggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp))

ggplot2 syntax

ggplot() +
    geom_violin(data = mtcars, 
                mapping = aes(x = factor(cyl), y = hp)) +
    geom_jitter(data = mtcars, 
               mapping = aes(x = factor(cyl), y = hp))

ggplot2 syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_jitter()

ggplot2 syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_jitter() +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower")

ggplot2 syntax

ggplot(data = mtcars, 
       mapping = aes(x = factor(cyl), y = hp)) +
    geom_violin() +
    geom_jitter() +
    labs(title = "Horsepower vs. Cylinder", x = "Cylinder", 
         y = "Horsepower") +
    theme_classic()

Agenda for today

Why do we need to transform data?

We rarely get data in exactly the form we need!

Transforming data in R is made easy by the dplyr package (“official” cheat sheet available here).

The 5 basic dplyr verbs

Toy example: Student scores

library(dplyr)
scores
##     Name Gender English Math Science History Spanish
## 1 Andrew      M      60   96      80      56      77
## 2   John      M      66   55      56      64      77
## 3   Mary      F      92   63      70      62      98
## 4   Jane      F      80   76      89      55      40
## 5    Bob      M      80   80      82      48      50
## 6    Dan      M      58   52      79      90      61

select: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

select: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

  1. Take the scores dataset.
  2. Select the Name and History columns.

select: pick subset of variables/columns by name

History teacher: “I just want their names and History scores”

  1. Take the scores dataset.
  2. Select the Name and History columns.
scores %>%
    select(Name, History)
##     Name History
## 1 Andrew      56
## 2   John      64
## 3   Mary      62
## 4   Jane      55
## 5    Bob      48
## 6    Dan      90

mutate: create new columns based on old ones

Form teacher: “What are their total scores?”

mutate: create new columns based on old ones

Form teacher: “What are their total scores?”

  1. Take the scores dataset.
  2. Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish

mutate: create new columns based on old ones

Form teacher: “What are their total scores?”

  1. Take the scores dataset.
  2. Add a new column by mutating existing columns: for each row, Total = English + Math + Science + History + Spanish
scores <- scores %>%
    mutate(Total = English + Math + Science + History + Spanish)
scores
##     Name Gender English Math Science History Spanish Total
## 1 Andrew      M      60   96      80      56      77   369
## 2   John      M      66   55      56      64      77   318
## 3   Mary      F      92   63      70      62      98   385
## 4   Jane      F      80   76      89      55      40   340
## 5    Bob      M      80   80      82      48      50   340
## 6    Dan      M      58   52      79      90      61   340

arrange: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

arrange: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

  1. Take the scores dataset.
  2. Arrange rows by the Total column.

arrange: reorder rows

Form teacher: “Can I have the students in order of overall performance?”

  1. Take the scores dataset.
  2. Arrange rows by the Total column.
scores %>%
    arrange(Total)
##     Name Gender English Math Science History Spanish Total
## 1   John      M      66   55      56      64      77   318
## 2   Jane      F      80   76      89      55      40   340
## 3    Bob      M      80   80      82      48      50   340
## 4    Dan      M      58   52      79      90      61   340
## 5 Andrew      M      60   96      80      56      77   369
## 6   Mary      F      92   63      70      62      98   385

arrange: reorder rows

Form teacher: “No no, better students on top please…”

arrange: reorder rows

Form teacher: “No no, better students on top please…”

  1. Take the scores dataset.
  2. Arrange rows by the Total column, but in descending order.

arrange: reorder rows

Form teacher: “No no, better students on top please…”

  1. Take the scores dataset.
  2. Arrange rows by the Total column, but in descending order.
scores %>%
    arrange(desc(Total))
##     Name Gender English Math Science History Spanish Total
## 1   Mary      F      92   63      70      62      98   385
## 2 Andrew      M      60   96      80      56      77   369
## 3   Jane      F      80   76      89      55      40   340
## 4    Bob      M      80   80      82      48      50   340
## 5    Dan      M      58   52      79      90      61   340
## 6   John      M      66   55      56      64      77   318

arrange: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

arrange: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

  1. Take the scores dataset.
  2. Arrange rows first by the Total column in descending order, then by the Name column (in ascending order).

arrange: reorder rows

Form teacher: “Can I have them in descending order of total scores, but if students tie, then by alphabetical order?”

  1. Take the scores dataset.
  2. Arrange rows first by the Total column in descending order, then by the Name column (in ascending order).
scores %>%
    arrange(desc(Total), Name)
##     Name Gender English Math Science History Spanish Total
## 1   Mary      F      92   63      70      62      98   385
## 2 Andrew      M      60   96      80      56      77   369
## 3    Bob      M      80   80      82      48      50   340
## 4    Dan      M      58   52      79      90      61   340
## 5   Jane      F      80   76      89      55      40   340
## 6   John      M      66   55      56      64      77   318

filter: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

filter: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

  1. Take the scores dataset.
  2. Filter for rows where the value in the History column is less than 60.

filter: pick observations by their values

History teacher: “I want to see which students scored less than 60 for history”

  1. Take the scores dataset.
  2. Filter for rows where the value in the History column is less than 60.
scores %>%
    filter(History < 60)
##     Name Gender English Math Science History Spanish Total
## 1 Andrew      M      60   96      80      56      77   369
## 2   Jane      F      80   76      89      55      40   340
## 3    Bob      M      80   80      82      48      50   340

More on filters

Other ways to make comparisons:

More on filters

Other ways to make comparisons:

Combining comparisons:

More filter examples

Dan’s parents: “I just want Dan’s scores”

More filter examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")
##   Name Gender English Math Science History Spanish Total
## 1  Dan      M      58   52      79      90      61   340

More filter examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")
##   Name Gender English Math Science History Spanish Total
## 1  Dan      M      58   52      79      90      61   340

Language teacher: “I want to know which students score < 50 for either English or Spanish”

More filter examples

Dan’s parents: “I just want Dan’s scores”

scores %>% 
    filter(Name == "Dan")
##   Name Gender English Math Science History Spanish Total
## 1  Dan      M      58   52      79      90      61   340

Language teacher: “I want to know which students score < 50 for either English or Spanish”

scores %>% 
    filter(English < 50 | Spanish < 50)
##   Name Gender English Math Science History Spanish Total
## 1 Jane      F      80   76      89      55      40   340

summarize: get summaries of data

Academic: “I want to know the correlation between math and science scores”

summarize: get summaries of data

Academic: “I want to know the correlation between math and science scores”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the correlation of the Math and Science columns.

summarize: get summaries of data

Academic: “I want to know the correlation between math and science scores”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the correlation of the Math and Science columns.
scores %>%
    summarize(corr = cor(Math, Science))
##        corr
## 1 0.5470561

summarize: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

summarize: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.

summarize: get summaries of data

Science teacher: “I want to know the mean and standard deviation of the scores for science”

  1. Take the scores dataset.
  2. Summarize the dataset by taking the mean of the Science column and the standard deviation of the science column.
scores %>%
    summarize(Science_mean = mean(Science), 
              Science_sd = sd(Science))
##   Science_mean Science_sd
## 1           76   11.54123

Chaining multiple dplyr commands using %>%

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

Chaining multiple dplyr commands using %>%

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have Science > 80.
  3. Select the Name column.

Chaining multiple dplyr commands using %>%

Science teacher: “I want to know which students scored > 80 for Science, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have Science > 80.
  3. Select the Name column.
scores %>%
    filter(Science > 80) %>%
    select(Name)
##   Name
## 1 Jane
## 2  Bob

group_by: use dplyr verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

group_by: use dplyr verbs on a group-by-group basis

Academic: “I want to know if the boys scored better than the girls in Spanish”

  1. Take the scores dataset.
  2. Group the dataset rows by gender.

group_by: use dplyr verbs on a group-by-group basis

Question: How many males and females are there in the data set?

  1. Take the scores dataset.
  2. Group the dataset rows by gender.
scores %>%
    group_by(Gender) %>%
    count()
## # A tibble: 2 x 2
## # Groups:   Gender [2]
##   Gender     n
##   <chr>  <int>
## 1 F          2
## 2 M          4

transmute: create new columns based on old ones, discard old ones

Form teacher: “I just want the mean score for each student”

scores %>% 
    transmute(mean = (English + Math + Science + History + Spanish) / 5)

Practice 1

Language teacher: “I want to know which students scored < 70 for both Spanish, but I just want names”

Practice 1

Language teacher: “I want to know which students scored < 70 for both Spanish, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have Spanish < 70.
scores %>%
    filter(Spanish < 70) %>%
    select(Name)
##   Name
## 1 Jane
## 2  Bob
## 3  Dan

Practice 2

Language teacher: “I want to know which students scored < 70 for both English and Spanish, but I just want names”

  1. Take the scores dataset.
  2. Filter for the rows which have English < 70 and Spanish < 70.
  3. Select the Name column.
scores %>%
    filter(English < 70 & Spanish < 70) %>%
    select(Name)
##   Name
## 1  Dan

Practice 4

History teacher: “I want the names of students with their history scores, with the entries sorted by name”

Practice 3

History teacher: “I want the names of students with their history scores, with the entries sorted by name”

  1. Take the scores dataset.
  2. Arrange the dataset rows by the name column.
  3. Select the Name and History columns.
scores %>%
    arrange(Name) %>%
    select(Name, History)
##     Name History
## 1 Andrew      56
## 2    Bob      48
## 3    Dan      90
## 4   Jane      55
## 5   John      64
## 6   Mary      62

TRUE/FALSE examples

3 > 2
## [1] TRUE
3 < 2
## [1] FALSE
3 == 2
## [1] FALSE
c(1, 2, 3, 1) == c(3, 2, 1, 2)
## [1] FALSE  TRUE FALSE FALSE
c(1, 2, 3, 1) == 1
## [1]  TRUE FALSE FALSE  TRUE

Be careful with NAs!

1 == NA
## [1] NA
NA == NA
## [1] NA
is.na(NA)
## [1] TRUE

More on %>%

Functions: R’s workhorse

A function is a named block of code which

(Source: codehs.gitbooks.io)

We use functions in R all the time

We’ve already seen a number of functions in R! For example,

is.character("123")
## [1] TRUE

The function is.character takes the input given to it in the parentheses and returns TRUE or FALSE, depending on whether the input is of type character or not.

Others we’ve seen: str(), head(), rm(), ggplot(), select(), …

We can see what a function does by typing in ? followed by the function name in the R console.

?is.character

Function syntax

The most important syntax in R is the function call. All R syntax has function calls underlying it.

A function call consists of:

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x)
## [1] NA

Function example

function_name(<inputs to the function>,
              <arguments which change 
              how the function operates>)
x <- c(-5, -3, -1, 1, 3, NA)
mean(x, na.rm = TRUE)
## [1] -1

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

Function calls read “inside out”

abs(x): If x is positive, return x. If x is negative, return x without the negative sign.

mean(abs(x), na.rm = TRUE)
## [1] 2.6

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

First answer: Google it! Google “R <function name>”

A deeper look at functions

Question: How do we find out what a function does? What inputs does it accept, what does it output, etc…

First answer: Google it! Google “R <function name>”

A (probably) better answer: Documentation in R itself!

sample(): Description

sample(): Usage

What comes after the = sign: default value for that argument

sample(): Arguments

sample(): Details

sample(): Value

How does R know which arguments we are referring to?

sample(x = 1:10, size = 10)
##  [1]  3  1  6 10  5  9  2  4  7  8
sample(1:10, 10, TRUE)
##  [1] 10  5 10  4  4  9  4  6  4  3
sample(1:10, TRUE, size = 5)
## [1] 10  1  5  9  5