STATS 60 Rlab Session 4

Yuchen Wu

2020/7/15

Recap of session 3

Recap of session 3

ALL of these functions take:

  1. A dataset, and
  2. Instructions on what to do with the dataset.

Recap of session 3

ALL of these functions take:

  1. A dataset, and
  2. Instructions on what to do with the dataset.

The dataset is either:

  1. The first argument within the function’s parentheses, e.g.
select(df, day)

Recap of session 3

ALL of these functions take:

  1. A dataset, and
  2. Instructions on what to do with the dataset.

The dataset is either:

  1. The first argument within the function’s parentheses, or
  2. Passed to the function through a “pipe” %>%, e.g.
df %>% select(day)

Recap of session 3

ALL of these functions return a dataset!

You can do three things with this returned dataset:

  1. Nothing, in which case it prints to screen.
  2. Save it by assigning it to a variable.
  3. Don’t save it, but pass it on to another function using a “pipe” %>%

%>% syntax with dplyr

Take the mtcars dataset, select just the wt and mpg columns, then select rows with mpg < 15

mtcars %>% 
    select(wt, mpg) %>% 
    filter(mpg < 15)

Agenda for today

tidyr::gather()

E.g. dataset of no. of cases for each country

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Probably want something like

ggplot(df) +
    geom_line(aes(x = year, y = cases, group = country))

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Problem: Column names are values of the variable year.

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset:

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using tidyr’s gather()

(Source: R for Data Science)

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using tidyr’s gather()

df %>% gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset using tidyr’s gather()

df %>% gather(`1999`, `2000`, key = "year", value = "cases") %>%
    ggplot() +
    geom_line(aes(x = as.numeric(year), y = cases, col = country))

tidyr::separate()

E.g. dataset of rate (cases / population) for each country

df
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

tidyr::separate()

How to get cases and population into columns of their own?

df
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

tidyr::separate()

How to get cases and population into columns of their own?

Solution: Use tidyr’s separate()

(Source: R for Data Science)

tidyr::separate()

How to get cases and population into columns of their own?

Solution: Use tidyr’s separate()

df %>% separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Where does your data live?

Filepath example (Mac)

(Source: iDB)

Filepath example (Windows)

File paths

File paths

Working directories in R

How can I change my working directory in RStudio?

  1. You can issue the command setwd("<path of new directory>")
  2. In the menu bar, click Session > Set Working Directory, then click one of the options in the sub-menu

Today’s dataset: Drought in California

Data source: United States Drought Monitor (USDM)

USDM: data download

USDM: data selection

The data in Excel