Before Class

HW4 extended to Saturday 6PM PT
Fill in the form about the final project if you haven’t done so. Only one team member needs to fill the form on behalf of the group.
A hint was added to HW4 2 (b), the updated homework is posted on the website.

Agenda for today

Function calls for hypothesis testings
User-defined functions

Recall: Lists

cars <- list(make = "Honda", 
             models = c("Fit", "CR-V", "Odyssey"), 
             available = c(TRUE, TRUE, TRUE))

Extracting parts of a list

Use [[ or $ notation to refer to a specific key-value pair

cars$make         # no quotation marks

## [1] "Honda"

cars[["models"]]  # remember quotation marks!

## [1] "Fit"     "CR-V"    "Odyssey"

Recall: Data frames are lists!

To R, a data frame is simply a special type of list!
- Keys of the list are the variable/covariate names
- Values are vectors of the same length

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated
Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated
Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
Is the p-value considered low or not?
- Threshold should depend on the context
- Typical thresholds, 0.1, 0.05, 0.01

Structure of a hypothesis test

Start with a null hypothesis: An assumption on how the data is generated
Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
Is the p-value considered low or not?
- Threshold should depend on the context
- Typical thresholds, 0.1, 0.05, 0.01
If p-value is below threshold, 2 possible conclusions:
- A rare event just happened, or
- Our assumption in Step 1 was false

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

A one-sample location test: whether the mean of a population has a value specified in a null hypothesis.

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

A one-sample location test: whether the mean of a population has a value specified in a null hypothesis.
Student’s t-tests: testing the means of two populations are equal, the variances of the two populations are also assumed to be equal.

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

A one-sample location test: whether the mean of a population has a value specified in a null hypothesis.
Student’s t-tests: testing the means of two populations are equal, the variances of the two populations are also assumed to be equal.
Welch’s t-test: testing the means of two populations are equal, the variances of the two populations are not assumed to be equal.

T tests

The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis.

Various forms of t test

A one-sample location test: whether the mean of a population has a value specified in a null hypothesis.
Student’s t-tests: testing the means of two populations are equal, the variances of the two populations are also assumed to be equal.
Welch’s t-test: testing the means of two populations are equal, the variances of the two populations are not assumed to be equal.
Paired two-sample t-test: testing the means of two populations are equal, statistical units are paired.

Function calls for conducting T tests

# one sample t-test
t.test(y,mu=3) # Ho: mu=3

# Student's t-test
t.test(y~x, var.equal = TRUE) # where y is numeric and x is a binary factor

# Welch's t-test
t.test(y1,y2, var.equal = FALSE) # where y1 and y2 are numeric

# paired t-test
t.test(y1,y2,paired=TRUE) # where y1 & y2 are numeric

Function calls for conducting T tests

# record the testing tesult in a list called "test_result"
test_result <- t.test(x, y) 

# extract the value of the t statistics
test_result$statistic

# extract the p value
test_result["p.value"]

# extract the confidence interval
test_result$conf.int

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

$H_0$: the percentage of students who have received a flu shot is 0.5.

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

$H_0$: the percentage of students who have received a flu shot is 0.5.

# Estimate the parameter of a binomial distribution
p.hat <- 83/150
p.hat

## [1] 0.5533333

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

$H_0$: the percentage of students who have received a flu shot is 0.5.

# Estimate the parameter of a binomial distribution
p.hat <- 83/150
p.hat

## [1] 0.5533333

# CI based on normal distribution
p.hat - 1.645*sqrt(p.hat*(1-p.hat)/149)

## [1] 0.4863359

p.hat + 1.645*sqrt(p.hat*(1-p.hat)/149)

## [1] 0.6203307

building confidence interval out of t test

From a sample of 150 students visiting the Health and Wellness center, 83 had obtained a flu shot. Find a 90% confidence interval for the percentage of students who have received a flu shot.

$H_0$: the percentage of students who have received a flu shot is 0.5.

# Estimate the parameter of a binomial distribution
p.hat <- 83/150
p.hat

## [1] 0.5533333

# CI based on t distribution
p.hat - qt(0.95, 149)*sqrt(p.hat*(1-p.hat)/149)

## [1] 0.4859228

p.hat - qt(0.05, 149)*sqrt(p.hat*(1-p.hat)/149)

## [1] 0.6207439

Chi-square tests

A chi-square test is a statistical hypothesis test that is valid to perform when the test statistic is chi-square distributed under the null hypothesis.

library(MASS)       # load the MASS package 
tbl = table(survey$Smoke, survey$Exer) 
tbl                 # the contingency table

##        
##         Freq None Some
##   Heavy    7    1    3
##   Never   87   18   84
##   Occas   12    3    4
##   Regul    9    1    7

Chi-square tests

test_result <- chisq.test(tbl) 
test_result

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828

test_result$statistic

## X-squared 
##  5.488546

test_result$p.value

## [1] 0.4828422

Other ways of using chi-square tests

data_frame <- read.csv("https://goo.gl/j6lRXD")
table(data_frame$treatment, data_frame$improvement)

##              
##               improved not-improved
##   not-treated       26           29
##   treated           35           15

chisq.test(data_frame$treatment, data_frame$improvement)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data_frame$treatment and data_frame$improvement
## X-squared = 4.6626, df = 1, p-value = 0.03083

Other ways of using chi-square tests

# the counts for categories A,B and C
x <- c(A = 80, B = 11, C = 9)  

# testing if each category is equally likely
chisq.test(x)

## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 98.06, df = 2, p-value < 2.2e-16

# testing if each category occurs with specified probability
chisq.test(x, p = c(0.8, 0.1, 0.1))

## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 0.2, df = 2, p-value = 0.9048

Correlation test

$H_0$: the true correlation coefficient is equal to 0

Permutation test: randomly permute the matching pairs in the original data set, compute the new correlation coefficient, repeat the experiments many times to simulate the distribution of the test statistics under the null distribution.
T test: for pairs from an uncorrelated bivariate normal distribution, the sampling distribution of a certain function of Pearson’s correlation coefficient follows Student’s t-distribution

Correlation test

test_result <- cor.test(x, y, method = c("pearson", "kendall", "spearman"))

test_result$statistic

test_result$p.value

test_result$conf.int

One sided correlation test

gpa <- c(3.45, 3.03, 2.67, 2.50, 3.16, 2.83)
distance_from_campus <- c(1.3, 0.8, 5.7, 0.5, 2.9, 3.1)

\[H_0: \rho \geq 0\]

\[ t = r \sqrt{\dfrac{n - 2}{1 - r^2}}, \qquad r = \dfrac{t}{\sqrt{n - 2 + t^2}} \]

plot(gpa, distance_from_campus)

One sided correlation test

r_hat <- cor(gpa, distance_from_campus)
t_stat <- r_hat * sqrt((6 - 2) / (1 - r_hat^2))
t_int_lower <- t_stat + qt(0.05, df = 6 - 2)
r_int_lower <- t_int_lower / sqrt(6 - 2 + t_int_lower^2)
print(r_int_lower)

## [1] -0.7897173

# "greater" corresponds to positive association, "less" to negative association
cortest <- cor.test(gpa, distance_from_campus, "greater")
cortest$conf.int

## [1] -0.8240341  1.0000000
## attr(,"conf.level")
## [1] 0.95

User-defined functions

One of the great strengths of R is the user’s ability to add functions. In fact, many of the functions in R are actually functions of functions. The structure of a function is given below.

myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}

User-defined functions

Let’s start by defining a function fahrenheit_to_celsius that converts temperatures from Fahrenheit to Celsius:

fahrenheit_to_celsius <- function(temp_F) {
  temp_C <- (temp_F - 32) * 5 / 9
  return(temp_C)
}

# freezing point of water
fahrenheit_to_celsius(32)

## [1] 0

# boiling point of water
fahrenheit_to_celsius(212)

## [1] 100

User-defined functions

Functions do not necessarily return a value!

rescale <- function(x)
{
  lower <- min(x, na.rm = TRUE)
  upper <- max(x, na.rm = TRUE)
  if(upper > lower)
  {
    return (x - lower) / (upper - lower)
  }
  else
  {
    print("x is a constant vector")
  }
}

STATS 60 Rlab Session 8

Before Class

Agenda for today

Recall: Lists

Extracting parts of a list

Recall: Data frames are lists!

Structure of a hypothesis test

Structure of a hypothesis test

Structure of a hypothesis test

Structure of a hypothesis test

T tests

T tests

T tests

T tests

T tests

Function calls for conducting T tests

Function calls for conducting T tests

building confidence interval out of t test

building confidence interval out of t test

building confidence interval out of t test

building confidence interval out of t test

building confidence interval out of t test

Chi-square tests

Chi-square tests

Other ways of using chi-square tests

Other ways of using chi-square tests

Correlation test

Correlation test

One sided correlation test

One sided correlation test

User-defined functions

User-defined functions

User-defined functions