Introduction

This is an analysis of SAT scores recorded across the nation from 2005 to 2015. Other pieces of data collected include the specific letter grade by subject, the average GPA by subject, family income level, gender, and the number of years a student has studied a specific subject. The dataset is stratified by year and state, where District of Columbia, Puerto Rico, and the Virgin Islands are included as separate regions. Note that this dataset only includes the math and verbal (reading) sections of the SAT, and excludes the writing portion.

For this project, I will focus on analyzing three specific questions based on this dataset:
1. On average, do males generally perform better than females on the math section of the SAT?
2. In 2015, how did the states compare to each other in terms of average total SAT score?
3. Is there a correlation between an arts/music education and a high SAT score?

The dataset used in this analysis can be found here: https://think.cs.vt.edu/corgis/csv/school_scores/school_scores.html

Setup

Library imports:

library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
library(maps)

Data import:

df <- read_csv("school_scores.csv")

Summary Statistics

Check to ensure that we have 53 regions (50 states, Washington D.C., Puerto Rico, and the Virgin Islands):
‘Name’ is the full name of the state for this report.

df %>% summarize(distinct = n_distinct(Name))
## # A tibble: 1 x 1
##   distinct
##      <int>
## 1       53

Check to ensure that we have 11 years represented (from 2005 to 2015):

df %>% summarize(distinct = n_distinct(Year))
## # A tibble: 1 x 1
##   distinct
##      <int>
## 1       11

The total number of observations in this dataset:

nrow(df)
## [1] 577

The correlation between the average GPA in math and the average math score:
Note that this is just the GPA within the subject, not across all academic subjects.

df %>% summarize(correlation = cor(`Mathematics.Average GPA`, Math))
## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1    0.811048

The large, positive correlation shows that higher GPAs in math are associated with higher scores on the math section of the SAT, as is expected.

The average SAT score (out of 1600) across the nation during the 11-year period:
‘Math’ is the average math score of students in this state during this year.
‘Verbal’ the average verbal (reading, not writing) score of students in this state during this year.

average_total_score <- mean(df$Math) + mean(df$Verbal)
average_total_score
## [1] 1067.017

The standard deviation of SAT scores across the nation during the 11-year period:

df1 <- df %>% mutate (Average_Total_Score = Math + Verbal)
standard_deviation <- sd(df1$Average_Total_Score)
standard_deviation
## [1] 90.00557

We can also represent this data in a histogram:

th <- theme(plot.title = element_text(face = "bold", hjust = 0.5), 
             axis.title = element_text(size = rel(1)),
             legend.position = "bottom")
ggplot(data = df1) +
  geom_histogram(mapping = aes(x = Average_Total_Score), bins = 20) + 
  labs(title = "Histogram of Average SAT Scores", x = "Average Total Score", y = "Frequency") + th

Histogram of Average SAT scores stratified by year:

ggplot(data = df1) +
  geom_histogram(mapping = aes(x = Average_Total_Score), bins = 20) + 
  labs(title = "Histogram of Average SAT Scores By Year", x = "Average Total Score", y = "Frequency") + 
  facet_wrap(~Year) + th

Data Analysis

As can be seen from the above section, the dataset contains an exorbitant amount of information. I will be only be focusing on certain factors to try to answer the three core questions listed in the “Introduction” section.

Gender difference in the math section of the SAT

A common stereotype in society is that boys are better at math than girls. While this is not necessarily true, the stereotype stems from the objective truth that boys have continued to score significantly higher than girls in the math section of the SAT. One article from AEIdeas reported that for over 40 years, a 30-point difference between boys’ and girls’ math scores has persisted*. In our dataset, we compare the first 10 rows of the boys’ mean math scores and the girls’ mean math scores. In each of the 10 observations, the boys score higher than the girls on average.

‘Male.Math’ is the average math score of students in this state during this year who identified as male.
‘Female.Math’ is fhe average math score of students in this state during this year who identified as female.

*The AEIdeas article can be found here: http://www.aei.org/publication/2015-sat-test-results-confirm-pattern-thats-persisted-for-40-years-high-school-boys-are-better-at-math-than-girls/

df_gendered_math_scores <- df %>% select(Male.Math, Female.Math, Name, Year)
kable(head(df_gendered_math_scores, n=10))
Male.Math Female.Math Name Year
582 538 Alabama 2005
535 505 Alaska 2005
549 513 Arizona 2005
570 536 Arkansas 2005
543 504 California 2005
577 546 Colorado 2005
534 502 Connecticut 2005
521 486 Delaware 2005
509 451 District Of Columbia 2005
516 484 Florida 2005

We can also compare the average male math score with the average female math score on the SAT (across the nation and over a span of 11 years):

mean(df$Male.Math)
## [1] 553.9116
mean(df$Female.Math)
## [1] 518.4159
difference = mean(df$Male.Math) - mean(df$Female.Math)
difference
## [1] 35.49567

From this, we can see that for the period 2005-2015, boys had a higher mean math score than girls. Specifically, boys scored 35.49567 more points than girls on average.

The mean difference calculated above is for a very large period spanning from 2005 to 2015. It would be more beneficial to us to see how the gender difference in math scores has changed throughout the years. To figure this out, we can create a new dataframe that includes the mean male score, mean female score, the difference between the two averages, and the year.

df_2005 <- df_gendered_math_scores %>% filter (Year == "2005")
df_2006 <- df_gendered_math_scores %>% filter (Year == "2006")
df_2007 <- df_gendered_math_scores %>% filter (Year == "2007")
df_2008 <- df_gendered_math_scores %>% filter (Year == "2008")
df_2009 <- df_gendered_math_scores %>% filter (Year == "2009")
df_2010 <- df_gendered_math_scores %>% filter (Year == "2010")
df_2011 <- df_gendered_math_scores %>% filter (Year == "2011")
df_2012 <- df_gendered_math_scores %>% filter (Year == "2012")
df_2013 <- df_gendered_math_scores %>% filter (Year == "2013")
df_2014 <- df_gendered_math_scores %>% filter (Year == "2014")
df_2015 <- df_gendered_math_scores %>% filter (Year == "2015")

df_mean_by_year <- data.frame(male_mean = c(mean(df_2005$Male.Math), 
                                            mean(df_2006$Male.Math), 
                                            mean(df_2007$Male.Math), 
                                            mean(df_2008$Male.Math), 
                                            mean(df_2009$Male.Math), 
                                            mean(df_2010$Male.Math), 
                                            mean(df_2011$Male.Math), 
                                            mean(df_2012$Male.Math), 
                                            mean(df_2013$Male.Math), 
                                            mean(df_2014$Male.Math), 
                                            mean(df_2015$Male.Math)), 
                              female_mean = c(mean(df_2005$Female.Math), 
                                              mean(df_2006$Female.Math), 
                                              mean(df_2007$Female.Math), 
                                              mean(df_2008$Female.Math), 
                                              mean(df_2009$Female.Math), 
                                              mean(df_2010$Female.Math), 
                                              mean(df_2011$Female.Math), 
                                              mean(df_2012$Female.Math), 
                                              mean(df_2013$Female.Math), 
                                              mean(df_2014$Female.Math),
                                              mean(df_2015$Female.Math)), 
                              year = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015))
df_mean_by_year <- df_mean_by_year %>% 
  mutate(diff = male_mean - female_mean) %>%
  select(male_mean, female_mean, diff, year)
kable(df_mean_by_year)
male_mean female_mean diff year
555.1538 518.9615 36.19231 2005
555.5769 520.2885 35.28846 2006
552.7925 518.4151 34.37736 2007
554.0943 518.3962 35.69811 2008
559.8235 522.1373 37.68627 2009
560.6667 523.9412 36.72549 2010
551.1887 515.7925 35.39623 2011
552.1509 515.6415 36.50943 2012
550.0000 515.8113 34.18868 2013
551.7170 517.3396 34.37736 2014
550.3962 516.2453 34.15094 2015

Since 2005, the point difference between the mean scores for boys and girls has not changed significantly for 11 years. While it has decreased very slightly, the difference in mean scores is consistently above 30 points, which is a statistically significant point difference. Many researchers and scholars have pointed towards differences in problem-solving strategies, spatial skills, and attitudes and values as reasons for the large point difference. More info about this can be found here: http://www.nctm.org/Publications/Teaching-Children-Mathematics/Blog/Current-Research-on-Gender-Differences-in-Math/

Boxplot of Male and Female Mean Math Scores By Year:

th <- theme(plot.title = element_text(face = "bold", hjust = 0.5), 
             axis.title = element_text(size = rel(1)),
             legend.position = "bottom")
df$Year <- factor (df$Year)
ggplot(data=df) +
  geom_boxplot(mapping = aes(x = Year, y = Male.Math), fill = NA, col = "blue") +
  labs(title = "Gendered Math Score Averages By Year", x = "Year", y = "Mean Math Score") + 
  geom_boxplot(mapping = aes(x = Year, y = Female.Math), fill = NA, col = "red") + th

The boxplot provides a visual representation of the conclusions found above. Blue represents male scores, while red represents female scores. In every year between 2005 and 2015, the mean math score for boys has always been higher than the mean math score for girls. Furthermore, the distance between the male mean and the female mean does not change substantially throughout the decade. Even the outliers of the female math scores are lower than the outliers of the male math scores.

SAT Scores by State

How does each state compare to the other states academically? To test this, we will find the mean SAT score for each state during the most current year (2015).

state_scores <- df1 %>% 
  filter(Year == "2015") %>%
  select(Average_Total_Score, Name)
state_scores <- state_scores[-c(9, 40, 48), ] #removes DC, Puerto Rico, and Virgin Islands
state_scores$Name = tolower(state_scores$Name)
colnames(state_scores)[colnames(state_scores) == 'Name'] <- 'region'
kable(state_scores %>% arrange(desc(Average_Total_Score), region))
Average_Total_Score region
1215 illinois
1207 north dakota
1204 michigan
1203 minnesota
1197 wisconsin
1195 missouri
1192 iowa
1192 south dakota
1182 kansas
1181 nebraska
1176 kentucky
1175 wyoming
1170 colorado
1155 tennessee
1154 utah
1149 mississippi
1146 oklahoma
1141 arkansas
1125 louisiana
1121 ohio
1119 montana
1096 new mexico
1088 alabama
1056 new hampshire
1052 arizona
1047 oregon
1046 massachusetts
1046 vermont
1035 virginia
1021 new jersey
1014 alaska
1013 washington
1010 connecticut
1009 west virginia
1004 pennsylvania
1003 california
1003 north carolina
997 indiana
996 hawaii
991 new york
990 nevada
989 rhode island
983 maryland
976 georgia
976 south carolina
967 florida
957 texas
940 maine
930 idaho
924 delaware

We then combine this data with the map_state data:

map_state <- map_data("state")
combined_data <- map_state %>% left_join(state_scores, by = "region")

This data can be visually displayed on a map of the U.S.

map_theme <- theme(
  axis.title.x = element_blank(), 
  axis.title.y = element_blank(),
  axis.text.x = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks.x = element_blank(),
  axis.ticks.y = element_blank(),
  panel.background = element_rect(fill = "white")
)

ggplot() +
  geom_polygon(data = combined_data, 
               mapping = aes(x = long, y = lat, group = region, fill = Average_Total_Score)) +
  geom_polygon(data = map_state,
               mapping = aes(x = long, y = lat, group = group), fill = NA, col = "black") +
  scale_fill_gradient(low = "red", high = "blue") +
  coord_quickmap() + map_theme +
  labs(title = "Average SAT Scores in 2015") + th

This matches the data we got in the table, where Illinois had the highest average SAT score and is thus the most blue on the map. Meanwhile, places like Texas, Idaho, and Florida have some of the lowest average SAT scores, and so are the most red in color. It is interesting to see that the midwest/Great Lakes region tends to have a higher SAT score average than places on the coast or in the south.

The Relationship Between an Arts Education and SAT Score

Is an education in the arts or music associated with a high SAT score? Many past studies have shown that kids who played a musical instrument tended to score higher on tests. Furthermore, Americans for the Arts reported that data from the CollegeBoard showed that students who take four years of arts scored on average 100 points better on the SAT than students who had half a year or less of arts education.

This report can be found here: https://www.americansforthearts.org/sites/default/files/pdf/get_involved/advocacy/research/2013/artsed_sat13.pdf

We can test to see if there is an association between an arts and music education and test scores by calculating the correlation between the two variables.

df$Year <- factor (df$Year)
correlation2 <- df1 %>% select(`Arts/Music.Average Years`, Average_Total_Score, Year)
correlation2 %>% summarize(correlation = cor(`Arts/Music.Average Years`, Average_Total_Score))
## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1   0.7465564

From this, we see that the correlation between arts education SAT score is high and positive. This means that the more years of arts or music education a student goes through, the higher their SAT score tends to be, which supports the findings of Americans for the Arts.

We can also plot the data in a scatterplot:

ggplot(data = correlation2, mapping = aes(x=`Arts/Music.Average Years`, y=Average_Total_Score, col = Year)) +
  geom_point(alpha = 0.8, position = "jitter") + 
  geom_smooth(method = "lm") +
  labs(title = "Correlation between Arts Education and SAT Score", x = "Number of Years of Arts Education", y = "Total SAT Score") + th

As can be seen from the line of best fit on the graph, there is a strong, positive relationship between the two factors. While we cannot identify causation between having a longer arts education and having a higher SAT score, we can still assume that they are at least correlated with each other.

Conclusions

In conclusion, the data analyzed in this project supports much of the research that other scholars in education have done. Here, we have shown three main findings from this data:

  1. Boys tend to score higher than girls in the math section on the SAT, and this pattern has continuously persisted for multiple decades.
  2. In 2015, the Midwest and the Great Lakes regions had higher average SAT scores than states on the coastlines. Illinois was the highest-achieving state, while Delaware had the lowest average.
  3. We also showed that there is indeed a positive relationship between having an arts or music education and scoring better on the SAT. These two factors are highly correlated.

Lastly, this project deviated from the original project proposal in several ways: