`tidyr::gather()`

E.g. dataset of no. of cases for each country

df

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

`tidyr::gather()`

How to make a line plot of no. of cases by year for each country?

df

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Probably want something like

ggplot(df) +
    geom_line(aes(x = year, y = cases, group = country))

`tidyr::gather()`

How to make a line plot of no. of cases by year for each country?

Problem: Column names are values of the variable year.

df

## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

`tidyr::gather()`

How to make a line plot of no. of cases by year for each country?

Solution: Reshape dataset:

df %>% gather(`1999`, `2000`, key = "year", value = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

`tidyr::separate()`

E.g. dataset of rate (cases / population) for each country

df

## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

`tidyr::separate()`

How to get cases and population into columns of their own?

df

## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

`tidyr::separate()`

How to get cases and population into columns of their own?

Solution: Use tidyr’s separate()

df %>% separate(rate, into = c("cases", "population"), sep = "/")

## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Agenda for today

Factors
Random sampling with R

Factors

A concept unique to R
Useful for working with categorical variables: variables that have a fixed and known set of possible values
Both numeric and character variables can be made into factors

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Reason 2: Character variables don’t sort in a useful way

x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)

## [1] "Apr" "Dec" "Jan" "Mar"

Why use factor variables instead of character variables?

Reason 1: Character variables don’t protect you from typos

x <- c("Dec", "Apr", "Jam", "Mar")

Reason 2: Character variables don’t sort in a useful way

x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)

## [1] "Apr" "Dec" "Jan" "Mar"

Factor variables can fix both of these easily.

How to convert a character variable to a factor variable?

Use factor() (in base R)
Give the function the list of valid categories, or levels

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x <- c("Dec", "Apr", "Jam", "Mar")

How to convert a character variable to a factor variable?

Use factor() (in base R)
Give the function the list of valid categories, or levels

y1 <- factor(x, levels = month_levels, ordered =TRUE)
y1

## [1] Dec  Apr  <NA> Mar 
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

How to convert a character variable to a factor variable?

Use factor() (in base R)
Give the function the list of valid categories, or levels

sort(y1)

## [1] Mar Apr Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

Why use factor variables instead of numerical variables?

Could change the content of factors easily by changing the “levels” or “labels”.

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata

##  [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
## Levels: 1 2 3

rdata = factor(data,labels=c("I","II","III"))
rdata

##  [1] I   II  II  III I   II  III III I   II  III III I  
## Levels: I II III

levels(fdata) = c('I','II','III')
fdata

##  [1] I   II  II  III I   II  III III I   II  III III I  
## Levels: I II III

Generating random data

R contains many functions that allow you generate random data either from a vector of data that you specify (like Heads or Tails from a coin), or from an established probability distribution, like the Normal or Uniform distribution.

sample()

# From the integers 1:10, draw 5 numbers, without replacement
sample(x = 1:10, size  = 5)

## [1] 5 6 8 9 3

# From the integers 1:10, draw 5 numbers, with replacement
sample(x = 1:10, size  = 5, replace = TRUE)

## [1] 1 1 2 7 1

sample()

sample(x = c("H", "T"), # The possible values of the coin
       size = 10,  # 10 flips
       replace = TRUE) # Sampling with replacement

##  [1] "T" "T" "H" "H" "T" "T" "H" "H" "H" "H"

sample(x = c("H", "T"),
       prob = c(.8, .2), # Make the coin biased for Heads
       size = 10,
       replace = TRUE)

##  [1] "H" "H" "T" "H" "H" "T" "H" "H" "H" "H"

Generate random data from specified probability distribution

?Distributions

Normal distribution

 # 5 samples from a Normal dist with mean = 0, sd = 1
rnorm(n = 5, mean = 0, sd = 1)

## [1]  0.64967554  0.68991786  1.41764235 -0.35829427 -0.06418326

# 3 samples from a Normal dist with mean = -10, sd = 15
rnorm(n = 3, mean = -10, sd = 15)

## [1] -32.119474 -42.201002   0.119997

Because the sampling is done randomly, you will get different values each time you run rnorm().

Uniform Distribution

Uniform distribution

# 5 samples from Uniform dist with bounds at 0 and 1
runif(n = 5, min = 0, max = 1)

## [1] 0.43948645 0.00909191 0.55663140 0.53614440 0.54634026

# 10 samples from Uniform dist with bounds at -100 and +100
runif(n = 10, min = -100, max = 100)

##  [1]  -9.921079  64.775792  42.996560  42.510502 -83.201566 -59.280297
##  [7]   7.649564  69.167467  18.237490  98.842050

Random samples will always change

Every time you draw a sample from a probability distribution, you will(likely) get a different result.

# Draw a sample of size 5 from a normal distribution with mean 100 and sd 10
rnorm(n = 5, mean = 100, sd = 10)

## [1]  95.81486 104.28819 103.88127  93.95388 102.88365

# Do it again!
rnorm(n = 5, mean = 100, sd = 10)

## [1] 106.10249  94.83656  89.86620 105.02997 101.63315

Use set.seed() to control random samples

set.seed(100)
rnorm(3, mean = 0, sd = 1)

## [1] -0.50219235  0.13153117 -0.07891709

# The result will always be -0.5022, 0.1315, -0.0789
set.seed(100)
rnorm(3, mean = 0, sd = 1)

## [1] -0.50219235  0.13153117 -0.07891709

STATS 60 Rlab Session 5

Recap of Session 4

`tidyr::gather()`

`tidyr::gather()`

`tidyr::gather()`

`tidyr::gather()`

`tidyr::separate()`

`tidyr::separate()`

`tidyr::separate()`

Agenda for today

Factors

Why use factor variables instead of character variables?

Why use factor variables instead of character variables?

Why use factor variables instead of character variables?

How to convert a character variable to a factor variable?

How to convert a character variable to a factor variable?

How to convert a character variable to a factor variable?

Why use factor variables instead of numerical variables?

Generating random data

sample()

sample()

sample()

Generate random data from specified probability distribution

Normal distribution

Normal distribution

Normal distribution

Uniform Distribution

Uniform distribution

Uniform distribution

Random samples will always change

Use set.seed() to control random samples