Yuchen Wu
2020/7/22
tidyr: gather and separate
working directory
tidyr::gather()
E.g. dataset of no. of cases for each country
df
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
tidyr::gather()
How to make a line plot of no. of cases by year for each country?
df
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
Probably want something like
ggplot(df) +
geom_line(aes(x = year, y = cases, group = country))
tidyr::gather()
How to make a line plot of no. of cases by year for each country?
Problem: Column names are values of the variable year
.
df
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
tidyr::gather()
How to make a line plot of no. of cases by year for each country?
Solution: Reshape dataset:
df %>% gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
tidyr::separate()
E.g. dataset of rate (cases / population) for each country
df
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
tidyr::separate()
How to get cases and population into columns of their own?
df
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
tidyr::separate()
How to get cases and population into columns of their own?
Solution: Use tidyr
’s separate()
df %>% separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Factors
Random sampling with R
Reason 1: Character variables don’t protect you from typos
x <- c("Dec", "Apr", "Jam", "Mar")
Reason 1: Character variables don’t protect you from typos
x <- c("Dec", "Apr", "Jam", "Mar")
Reason 2: Character variables don’t sort in a useful way
x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jan" "Mar"
Reason 1: Character variables don’t protect you from typos
x <- c("Dec", "Apr", "Jam", "Mar")
Reason 2: Character variables don’t sort in a useful way
x <- c("Dec", "Apr", "Jan", "Mar")
sort(x)
## [1] "Apr" "Dec" "Jan" "Mar"
Factor variables can fix both of these easily.
factor()
(in base R)month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
x <- c("Dec", "Apr", "Jam", "Mar")
factor()
(in base R)y1 <- factor(x, levels = month_levels, ordered =TRUE)
y1
## [1] Dec Apr <NA> Mar
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
factor()
(in base R)sort(y1)
## [1] Mar Apr Dec
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata
## [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
## Levels: 1 2 3
rdata = factor(data,labels=c("I","II","III"))
rdata
## [1] I II II III I II III III I II III III I
## Levels: I II III
levels(fdata) = c('I','II','III')
fdata
## [1] I II II III I II III III I II III III I
## Levels: I II III
R contains many functions that allow you generate random data either from a vector of data that you specify (like Heads or Tails from a coin), or from an established probability distribution, like the Normal or Uniform distribution.
# From the integers 1:10, draw 5 numbers, without replacement
sample(x = 1:10, size = 5)
## [1] 5 6 8 9 3
# From the integers 1:10, draw 5 numbers, with replacement
sample(x = 1:10, size = 5, replace = TRUE)
## [1] 1 1 2 7 1
sample(x = c("H", "T"), # The possible values of the coin
size = 10, # 10 flips
replace = TRUE) # Sampling with replacement
## [1] "T" "T" "H" "H" "T" "T" "H" "H" "H" "H"
sample(x = c("H", "T"),
prob = c(.8, .2), # Make the coin biased for Heads
size = 10,
replace = TRUE)
## [1] "H" "H" "T" "H" "H" "T" "H" "H" "H" "H"
?Distributions
# 5 samples from a Normal dist with mean = 0, sd = 1
rnorm(n = 5, mean = 0, sd = 1)
## [1] 0.64967554 0.68991786 1.41764235 -0.35829427 -0.06418326
# 3 samples from a Normal dist with mean = -10, sd = 15
rnorm(n = 3, mean = -10, sd = 15)
## [1] -32.119474 -42.201002 0.119997
Because the sampling is done randomly, you will get different values each time you run rnorm().
# 5 samples from Uniform dist with bounds at 0 and 1
runif(n = 5, min = 0, max = 1)
## [1] 0.43948645 0.00909191 0.55663140 0.53614440 0.54634026
# 10 samples from Uniform dist with bounds at -100 and +100
runif(n = 10, min = -100, max = 100)
## [1] -9.921079 64.775792 42.996560 42.510502 -83.201566 -59.280297
## [7] 7.649564 69.167467 18.237490 98.842050
Every time you draw a sample from a probability distribution, you will(likely) get a different result.
# Draw a sample of size 5 from a normal distribution with mean 100 and sd 10
rnorm(n = 5, mean = 100, sd = 10)
## [1] 95.81486 104.28819 103.88127 93.95388 102.88365
# Do it again!
rnorm(n = 5, mean = 100, sd = 10)
## [1] 106.10249 94.83656 89.86620 105.02997 101.63315
set.seed(100)
rnorm(3, mean = 0, sd = 1)
## [1] -0.50219235 0.13153117 -0.07891709
# The result will always be -0.5022, 0.1315, -0.0789
set.seed(100)
rnorm(3, mean = 0, sd = 1)
## [1] -0.50219235 0.13153117 -0.07891709