STATS 60 Rlab Session 8

Yuchen Wu

2020/8/12

Before Class

Agenda for today

Recall: Lists

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Extracting parts of a list

Use [[ or $ notation to refer to a specific key-value pair

cars$make         # no quotation marks
## [1] "Honda"
cars[["models"]]  # remember quotation marks!
## [1] "Fit"     "CR-V"    "Odyssey"

Recall: Data frames are lists!

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
  3. Is the p-value considered low or not?
    • Threshold should depend on the context
    • Typical thresholds, 0.1, 0.05, 0.01

Structure of a hypothesis test

  1. Start with a null hypothesis: An assumption on how the data is generated
  2. Based on this assumption, how likely were we to collect data as extreme as what we have?
    • p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
  3. Is the p-value considered low or not?
    • Threshold should depend on the context
    • Typical thresholds, 0.1, 0.05, 0.01
  4. If p-value is below threshold, 2 possible conclusions:
    • A rare event just happened, or
    • Our assumption in Step 1 was false

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

Function calls for conducting T tests

# one sample t-test
t.test(y,mu=3) # Ho: mu=3
# Student's t-test
t.test(y~x, var.equal = TRUE) # where y is numeric and x is a binary factor
# Welch's t-test
t.test(y1,y2, var.equal = FALSE) # where y1 and y2 are numeric
# paired t-test
t.test(y1,y2,paired=TRUE) # where y1 & y2 are numeric

Function calls for conducting T tests

# record the testing tesult in a list called "test_result"
test_result <- t.test(x, y) 

# extract the value of the t statistics
test_result$statistic

# extract the p value
test_result["p.value"]

# extract the confidence interval
test_result$conf.int

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

\(H_0\): the percentage of students who have received a flu shot is 0.5.

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

\(H_0\): the percentage of students who have received a flu shot is 0.5.

# Estimate the parameter of a binomial distribution
p.hat <- 83/150
p.hat
## [1] 0.5533333

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

\(H_0\): the percentage of students who have received a flu shot is 0.5.

# Estimate the parameter of a binomial distribution
p.hat <- 83/150
p.hat
## [1] 0.5533333
# CI based on normal distribution
p.hat - 1.645*sqrt(p.hat*(1-p.hat)/149)
## [1] 0.4863359
p.hat + 1.645*sqrt(p.hat*(1-p.hat)/149)
## [1] 0.6203307

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

\(H_0\): the percentage of students who have received a flu shot is 0.5.

# Estimate the parameter of a binomial distribution
p.hat <- 83/150
p.hat
## [1] 0.5533333
# CI based on t distribution
p.hat - qt(0.95, 149)*sqrt(p.hat*(1-p.hat)/149)
## [1] 0.4859228
p.hat - qt(0.05, 149)*sqrt(p.hat*(1-p.hat)/149)
## [1] 0.6207439

Chi-square tests

A chi-square test is a statistical hypothesis test that is valid to perform when the test statistic is chi-square distributed under the null hypothesis.

library(MASS)       # load the MASS package 
tbl = table(survey$Smoke, survey$Exer) 
tbl                 # the contingency table 
##        
##         Freq None Some
##   Heavy    7    1    3
##   Never   87   18   84
##   Occas   12    3    4
##   Regul    9    1    7

Chi-square tests

test_result <- chisq.test(tbl) 
test_result
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828
test_result$statistic
## X-squared 
##  5.488546
test_result$p.value
## [1] 0.4828422

Other ways of using chi-square tests

data_frame <- read.csv("https://goo.gl/j6lRXD")
table(data_frame$treatment, data_frame$improvement)
##              
##               improved not-improved
##   not-treated       26           29
##   treated           35           15
chisq.test(data_frame$treatment, data_frame$improvement)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data_frame$treatment and data_frame$improvement
## X-squared = 4.6626, df = 1, p-value = 0.03083

Other ways of using chi-square tests

# the counts for categories A,B and C
x <- c(A = 80, B = 11, C = 9)  

# testing if each category is equally likely
chisq.test(x)
## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 98.06, df = 2, p-value < 2.2e-16
# testing if each category occurs with specified probability
chisq.test(x, p = c(0.8, 0.1, 0.1))
## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 0.2, df = 2, p-value = 0.9048

Correlation test

\(H_0\): the true correlation coefficient is equal to 0

  1. Permutation test: randomly permute the matching pairs in the original data set, compute the new correlation coefficient, repeat the experiments many times to simulate the distribution of the test statistics under the null distribution.

  2. T test: for pairs from an uncorrelated bivariate normal distribution, the sampling distribution of a certain function of Pearson’s correlation coefficient follows Student’s t-distribution

Correlation test

test_result <- cor.test(x, y, method = c("pearson", "kendall", "spearman"))
test_result$statistic
test_result$p.value
test_result$conf.int

One sided correlation test

gpa <- c(3.45, 3.03, 2.67, 2.50, 3.16, 2.83)
distance_from_campus <- c(1.3, 0.8, 5.7, 0.5, 2.9, 3.1)

\[H_0: \rho \geq 0\]

\[ t = r \sqrt{\dfrac{n - 2}{1 - r^2}}, \qquad r = \dfrac{t}{\sqrt{n - 2 + t^2}} \]

plot(gpa, distance_from_campus)

One sided correlation test

r_hat <- cor(gpa, distance_from_campus)
t_stat <- r_hat * sqrt((6 - 2) / (1 - r_hat^2))
t_int_lower <- t_stat + qt(0.05, df = 6 - 2)
r_int_lower <- t_int_lower / sqrt(6 - 2 + t_int_lower^2)
print(r_int_lower)
## [1] -0.7897173
# "greater" corresponds to positive association, "less" to negative association
cortest <- cor.test(gpa, distance_from_campus, "greater")
cortest$conf.int
## [1] -0.8240341  1.0000000
## attr(,"conf.level")
## [1] 0.95

User-defined functions

One of the great strengths of R is the user’s ability to add functions. In fact, many of the functions in R are actually functions of functions. The structure of a function is given below.

myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}

User-defined functions

Let’s start by defining a function fahrenheit_to_celsius that converts temperatures from Fahrenheit to Celsius:

fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}
# freezing point of water
fahrenheit_to_celsius(32)
## [1] 0
# boiling point of water
fahrenheit_to_celsius(212)
## [1] 100

User-defined functions

Functions do not necessarily return a value!

rescale <- function(x)
{
  lower <- min(x, na.rm = TRUE)
  upper <- max(x, na.rm = TRUE)
  if(upper > lower)
  {
    return (x - lower) / (upper - lower)
  }
  else
  {
    print("x is a constant vector")
  }
}