STATS 60 Rlab Session 5

Yuchen Wu

2020/7/22

Recap of Session 4

tidyr::gather()

E.g. dataset of no. of cases for each country

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Probably want something like

ggplot(df) +
    geom_line(aes(x = year, y = cases, group = country))

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Problem: Column names are values of the variable year.

df
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

tidyr::gather()

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset:

df %>% gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

tidyr::separate()

E.g. dataset of rate (cases / population) for each country

df
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

tidyr::separate()

How to get cases and population into columns of their own?

df
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

tidyr::separate()

How to get cases and population into columns of their own?

Solution: Use tidyr’s separate()

df %>% separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Agenda for today

Factors

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Reason 2: Character variables don’t sort in a useful way

x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jan" "Mar"

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Reason 2: Character variables don’t sort in a useful way

x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jan" "Mar"

Factor variables can fix both of these easily.

How to convert a character variable to a factor variable?

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x <- c("Dec", "Apr", "Jam", "Mar")

How to convert a character variable to a factor variable?

y1 <- factor(x, levels = month_levels, ordered =TRUE)
y1
## [1] Dec  Apr  <NA> Mar 
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

How to convert a character variable to a factor variable?

sort(y1)
## [1] Mar Apr Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

Why use factor variables instead of numerical variables?

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata
##  [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
## Levels: 1 2 3
rdata = factor(data,labels=c("I","II","III"))
rdata
##  [1] I   II  II  III I   II  III III I   II  III III I  
## Levels: I II III
levels(fdata) = c('I','II','III')
fdata
##  [1] I   II  II  III I   II  III III I   II  III III I  
## Levels: I II III

Generating random data

R contains many functions that allow you generate random data either from a vector of data that you specify (like Heads or Tails from a coin), or from an established probability distribution, like the Normal or Uniform distribution.

sample()

sample()

# From the integers 1:10, draw 5 numbers, without replacement
sample(x = 1:10, size  = 5)
## [1] 5 6 8 9 3
# From the integers 1:10, draw 5 numbers, with replacement
sample(x = 1:10, size  = 5, replace = TRUE)
## [1] 1 1 2 7 1

sample()

sample(x = c("H", "T"), # The possible values of the coin
       size = 10,  # 10 flips
       replace = TRUE) # Sampling with replacement
##  [1] "T" "T" "H" "H" "T" "T" "H" "H" "H" "H"
sample(x = c("H", "T"),
       prob = c(.8, .2), # Make the coin biased for Heads
       size = 10,
       replace = TRUE)
##  [1] "H" "H" "T" "H" "H" "T" "H" "H" "H" "H"

Generate random data from specified probability distribution

?Distributions

Normal distribution

Normal distribution

Normal distribution

 # 5 samples from a Normal dist with mean = 0, sd = 1
rnorm(n = 5, mean = 0, sd = 1)
## [1]  0.64967554  0.68991786  1.41764235 -0.35829427 -0.06418326
# 3 samples from a Normal dist with mean = -10, sd = 15
rnorm(n = 3, mean = -10, sd = 15)
## [1] -32.119474 -42.201002   0.119997

Because the sampling is done randomly, you will get different values each time you run rnorm().

Uniform Distribution

Uniform distribution

Uniform distribution

# 5 samples from Uniform dist with bounds at 0 and 1
runif(n = 5, min = 0, max = 1)
## [1] 0.43948645 0.00909191 0.55663140 0.53614440 0.54634026
# 10 samples from Uniform dist with bounds at -100 and +100
runif(n = 10, min = -100, max = 100)
##  [1]  -9.921079  64.775792  42.996560  42.510502 -83.201566 -59.280297
##  [7]   7.649564  69.167467  18.237490  98.842050

Random samples will always change

Every time you draw a sample from a probability distribution, you will(likely) get a different result.

# Draw a sample of size 5 from a normal distribution with mean 100 and sd 10
rnorm(n = 5, mean = 100, sd = 10)
## [1]  95.81486 104.28819 103.88127  93.95388 102.88365
# Do it again!
rnorm(n = 5, mean = 100, sd = 10)
## [1] 106.10249  94.83656  89.86620 105.02997 101.63315

Use set.seed() to control random samples

set.seed(100)
rnorm(3, mean = 0, sd = 1)
## [1] -0.50219235  0.13153117 -0.07891709
# The result will always be -0.5022, 0.1315, -0.0789
set.seed(100)
rnorm(3, mean = 0, sd = 1)
## [1] -0.50219235  0.13153117 -0.07891709