SAT Score Analysis || Stats 32 Final Project

Introduction

This is an analysis of SAT scores recorded across the nation from 2005 to 2015. Other pieces of data collected include the specific letter grade by subject, the average GPA by subject, family income level, gender, and the number of years a student has studied a specific subject. The dataset is stratified by year and state, where District of Columbia, Puerto Rico, and the Virgin Islands are included as separate regions. Note that this dataset only includes the math and verbal (reading) sections of the SAT, and excludes the writing portion.

For this project, I will focus on analyzing three specific questions based on this dataset:
1. On average, do males generally perform better than females on the math section of the SAT?
2. In 2015, how did the states compare to each other in terms of average total SAT score?
3. Is there a correlation between an arts/music education and a high SAT score?

The dataset used in this analysis can be found here: https://think.cs.vt.edu/corgis/csv/school_scores/school_scores.html

Setup

Library imports:

library(dplyr)
library(ggplot2)
library(knitr)
library(readr)
library(maps)

Data import:

df <- read_csv("school_scores.csv")

Summary Statistics

Check to ensure that we have 53 regions (50 states, Washington D.C., Puerto Rico, and the Virgin Islands):
‘Name’ is the full name of the state for this report.

df %>% summarize(distinct = n_distinct(Name))

## # A tibble: 1 x 1
##   distinct
##      <int>
## 1       53

Check to ensure that we have 11 years represented (from 2005 to 2015):

df %>% summarize(distinct = n_distinct(Year))

## # A tibble: 1 x 1
##   distinct
##      <int>
## 1       11

The total number of observations in this dataset:

nrow(df)

## [1] 577

The correlation between the average GPA in math and the average math score:
Note that this is just the GPA within the subject, not across all academic subjects.

df %>% summarize(correlation = cor(`Mathematics.Average GPA`, Math))

## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1    0.811048

The large, positive correlation shows that higher GPAs in math are associated with higher scores on the math section of the SAT, as is expected.

The average SAT score (out of 1600) across the nation during the 11-year period:
‘Math’ is the average math score of students in this state during this year.
‘Verbal’ the average verbal (reading, not writing) score of students in this state during this year.

average_total_score <- mean(df$Math) + mean(df$Verbal)
average_total_score

## [1] 1067.017

The standard deviation of SAT scores across the nation during the 11-year period:

df1 <- df %>% mutate (Average_Total_Score = Math + Verbal)
standard_deviation <- sd(df1$Average_Total_Score)
standard_deviation

## [1] 90.00557

We can also represent this data in a histogram:

th <- theme(plot.title = element_text(face = "bold", hjust = 0.5), 
             axis.title = element_text(size = rel(1)),
             legend.position = "bottom")
ggplot(data = df1) +
  geom_histogram(mapping = aes(x = Average_Total_Score), bins = 20) + 
  labs(title = "Histogram of Average SAT Scores", x = "Average Total Score", y = "Frequency") + th

Histogram of Average SAT scores stratified by year:

ggplot(data = df1) +
  geom_histogram(mapping = aes(x = Average_Total_Score), bins = 20) + 
  labs(title = "Histogram of Average SAT Scores By Year", x = "Average Total Score", y = "Frequency") + 
  facet_wrap(~Year) + th

Data Analysis

As can be seen from the above section, the dataset contains an exorbitant amount of information. I will be only be focusing on certain factors to try to answer the three core questions listed in the “Introduction” section.

Gender difference in the math section of the SAT

A common stereotype in society is that boys are better at math than girls. While this is not necessarily true, the stereotype stems from the objective truth that boys have continued to score significantly higher than girls in the math section of the SAT. One article from AEIdeas reported that for over 40 years, a 30-point difference between boys’ and girls’ math scores has persisted*. In our dataset, we compare the first 10 rows of the boys’ mean math scores and the girls’ mean math scores. In each of the 10 observations, the boys score higher than the girls on average.

‘Male.Math’ is the average math score of students in this state during this year who identified as male.
‘Female.Math’ is fhe average math score of students in this state during this year who identified as female.

*The AEIdeas article can be found here: http://www.aei.org/publication/2015-sat-test-results-confirm-pattern-thats-persisted-for-40-years-high-school-boys-are-better-at-math-than-girls/

df_gendered_math_scores <- df %>% select(Male.Math, Female.Math, Name, Year)
kable(head(df_gendered_math_scores, n=10))

Male.Math	Female.Math	Name	Year
582	538	Alabama	2005
535	505	Alaska	2005
549	513	Arizona	2005
570	536	Arkansas	2005
543	504	California	2005
577	546	Colorado	2005
534	502	Connecticut	2005
521	486	Delaware	2005
509	451	District Of Columbia	2005
516	484	Florida	2005

We can also compare the average male math score with the average female math score on the SAT (across the nation and over a span of 11 years):

mean(df$Male.Math)

## [1] 553.9116

mean(df$Female.Math)

## [1] 518.4159

difference = mean(df$Male.Math) - mean(df$Female.Math)
difference

## [1] 35.49567

From this, we can see that for the period 2005-2015, boys had a higher mean math score than girls. Specifically, boys scored 35.49567 more points than girls on average.

The mean difference calculated above is for a very large period spanning from 2005 to 2015. It would be more beneficial to us to see how the gender difference in math scores has changed throughout the years. To figure this out, we can create a new dataframe that includes the mean male score, mean female score, the difference between the two averages, and the year.

df_2005 <- df_gendered_math_scores %>% filter (Year == "2005")
df_2006 <- df_gendered_math_scores %>% filter (Year == "2006")
df_2007 <- df_gendered_math_scores %>% filter (Year == "2007")
df_2008 <- df_gendered_math_scores %>% filter (Year == "2008")
df_2009 <- df_gendered_math_scores %>% filter (Year == "2009")
df_2010 <- df_gendered_math_scores %>% filter (Year == "2010")
df_2011 <- df_gendered_math_scores %>% filter (Year == "2011")
df_2012 <- df_gendered_math_scores %>% filter (Year == "2012")
df_2013 <- df_gendered_math_scores %>% filter (Year == "2013")
df_2014 <- df_gendered_math_scores %>% filter (Year == "2014")
df_2015 <- df_gendered_math_scores %>% filter (Year == "2015")

df_mean_by_year <- data.frame(male_mean = c(mean(df_2005$Male.Math), 
                                            mean(df_2006$Male.Math), 
                                            mean(df_2007$Male.Math), 
                                            mean(df_2008$Male.Math), 
                                            mean(df_2009$Male.Math), 
                                            mean(df_2010$Male.Math), 
                                            mean(df_2011$Male.Math), 
                                            mean(df_2012$Male.Math), 
                                            mean(df_2013$Male.Math), 
                                            mean(df_2014$Male.Math), 
                                            mean(df_2015$Male.Math)), 
                              female_mean = c(mean(df_2005$Female.Math), 
                                              mean(df_2006$Female.Math), 
                                              mean(df_2007$Female.Math), 
                                              mean(df_2008$Female.Math), 
                                              mean(df_2009$Female.Math), 
                                              mean(df_2010$Female.Math), 
                                              mean(df_2011$Female.Math), 
                                              mean(df_2012$Female.Math), 
                                              mean(df_2013$Female.Math), 
                                              mean(df_2014$Female.Math),
                                              mean(df_2015$Female.Math)), 
                              year = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015))
df_mean_by_year <- df_mean_by_year %>% 
  mutate(diff = male_mean - female_mean) %>%
  select(male_mean, female_mean, diff, year)
kable(df_mean_by_year)

male_mean	female_mean	diff	year
555.1538	518.9615	36.19231	2005
555.5769	520.2885	35.28846	2006
552.7925	518.4151	34.37736	2007
554.0943	518.3962	35.69811	2008
559.8235	522.1373	37.68627	2009
560.6667	523.9412	36.72549	2010
551.1887	515.7925	35.39623	2011
552.1509	515.6415	36.50943	2012
550.0000	515.8113	34.18868	2013
551.7170	517.3396	34.37736	2014
550.3962	516.2453	34.15094	2015

Since 2005, the point difference between the mean scores for boys and girls has not changed significantly for 11 years. While it has decreased very slightly, the difference in mean scores is consistently above 30 points, which is a statistically significant point difference. Many researchers and scholars have pointed towards differences in problem-solving strategies, spatial skills, and attitudes and values as reasons for the large point difference. More info about this can be found here: http://www.nctm.org/Publications/Teaching-Children-Mathematics/Blog/Current-Research-on-Gender-Differences-in-Math/

Boxplot of Male and Female Mean Math Scores By Year:

th <- theme(plot.title = element_text(face = "bold", hjust = 0.5), 
             axis.title = element_text(size = rel(1)),
             legend.position = "bottom")
df$Year <- factor (df$Year)
ggplot(data=df) +
  geom_boxplot(mapping = aes(x = Year, y = Male.Math), fill = NA, col = "blue") +
  labs(title = "Gendered Math Score Averages By Year", x = "Year", y = "Mean Math Score") + 
  geom_boxplot(mapping = aes(x = Year, y = Female.Math), fill = NA, col = "red") + th

The boxplot provides a visual representation of the conclusions found above. Blue represents male scores, while red represents female scores. In every year between 2005 and 2015, the mean math score for boys has always been higher than the mean math score for girls. Furthermore, the distance between the male mean and the female mean does not change substantially throughout the decade. Even the outliers of the female math scores are lower than the outliers of the male math scores.

SAT Scores by State

How does each state compare to the other states academically? To test this, we will find the mean SAT score for each state during the most current year (2015).

state_scores <- df1 %>% 
  filter(Year == "2015") %>%
  select(Average_Total_Score, Name)
state_scores <- state_scores[-c(9, 40, 48), ] #removes DC, Puerto Rico, and Virgin Islands
state_scores$Name = tolower(state_scores$Name)
colnames(state_scores)[colnames(state_scores) == 'Name'] <- 'region'
kable(state_scores %>% arrange(desc(Average_Total_Score), region))

Average_Total_Score	region
1215	illinois
1207	north dakota
1204	michigan
1203	minnesota
1197	wisconsin
1195	missouri
1192	iowa
1192	south dakota
1182	kansas
1181	nebraska
1176	kentucky
1175	wyoming
1170	colorado
1155	tennessee
1154	utah
1149	mississippi
1146	oklahoma
1141	arkansas
1125	louisiana
1121	ohio
1119	montana
1096	new mexico
1088	alabama
1056	new hampshire
1052	arizona
1047	oregon
1046	massachusetts
1046	vermont
1035	virginia
1021	new jersey
1014	alaska
1013	washington
1010	connecticut
1009	west virginia
1004	pennsylvania
1003	california
1003	north carolina
997	indiana
996	hawaii
991	new york
990	nevada
989	rhode island
983	maryland
976	georgia
976	south carolina
967	florida
957	texas
940	maine
930	idaho
924	delaware

We then combine this data with the map_state data:

map_state <- map_data("state")
combined_data <- map_state %>% left_join(state_scores, by = "region")

This data can be visually displayed on a map of the U.S.

map_theme <- theme(
  axis.title.x = element_blank(), 
  axis.title.y = element_blank(),
  axis.text.x = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks.x = element_blank(),
  axis.ticks.y = element_blank(),
  panel.background = element_rect(fill = "white")
)

ggplot() +
  geom_polygon(data = combined_data, 
               mapping = aes(x = long, y = lat, group = region, fill = Average_Total_Score)) +
  geom_polygon(data = map_state,
               mapping = aes(x = long, y = lat, group = group), fill = NA, col = "black") +
  scale_fill_gradient(low = "red", high = "blue") +
  coord_quickmap() + map_theme +
  labs(title = "Average SAT Scores in 2015") + th

This matches the data we got in the table, where Illinois had the highest average SAT score and is thus the most blue on the map. Meanwhile, places like Texas, Idaho, and Florida have some of the lowest average SAT scores, and so are the most red in color. It is interesting to see that the midwest/Great Lakes region tends to have a higher SAT score average than places on the coast or in the south.

The Relationship Between an Arts Education and SAT Score

Is an education in the arts or music associated with a high SAT score? Many past studies have shown that kids who played a musical instrument tended to score higher on tests. Furthermore, Americans for the Arts reported that data from the CollegeBoard showed that students who take four years of arts scored on average 100 points better on the SAT than students who had half a year or less of arts education.

This report can be found here: https://www.americansforthearts.org/sites/default/files/pdf/get_involved/advocacy/research/2013/artsed_sat13.pdf

We can test to see if there is an association between an arts and music education and test scores by calculating the correlation between the two variables.

df$Year <- factor (df$Year)
correlation2 <- df1 %>% select(`Arts/Music.Average Years`, Average_Total_Score, Year)
correlation2 %>% summarize(correlation = cor(`Arts/Music.Average Years`, Average_Total_Score))

## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1   0.7465564

From this, we see that the correlation between arts education SAT score is high and positive. This means that the more years of arts or music education a student goes through, the higher their SAT score tends to be, which supports the findings of Americans for the Arts.

We can also plot the data in a scatterplot:

ggplot(data = correlation2, mapping = aes(x=`Arts/Music.Average Years`, y=Average_Total_Score, col = Year)) +
  geom_point(alpha = 0.8, position = "jitter") + 
  geom_smooth(method = "lm") +
  labs(title = "Correlation between Arts Education and SAT Score", x = "Number of Years of Arts Education", y = "Total SAT Score") + th

As can be seen from the line of best fit on the graph, there is a strong, positive relationship between the two factors. While we cannot identify causation between having a longer arts education and having a higher SAT score, we can still assume that they are at least correlated with each other.

Conclusions

In conclusion, the data analyzed in this project supports much of the research that other scholars in education have done. Here, we have shown three main findings from this data:

Boys tend to score higher than girls in the math section on the SAT, and this pattern has continuously persisted for multiple decades.
In 2015, the Midwest and the Great Lakes regions had higher average SAT scores than states on the coastlines. Illinois was the highest-achieving state, while Delaware had the lowest average.
We also showed that there is indeed a positive relationship between having an arts or music education and scoring better on the SAT. These two factors are highly correlated.

Lastly, this project deviated from the original project proposal in several ways:

As was suggested, I focused on only 3 core questions rather than the entire list of 10 that I had originally submitted. This was to ensure that I was not spreading myself thin and could therefore really focusing on developing a strong analysis.
Because I limited the number of questions I was asking, the graphs I had originally suggested in my project proposal were no longer applicable. Instead of overlaying two histograms, I overlayed two boxplots to compare boys’ and girls’ math SAT scores. I felt that this provided a better visual of the difference between the two sets of scores.
Because I was no longer looking into the relationship between SAT scores and GPA, I used a scatterplot to show the relationship between SAT scores and arts education instead.
The map was not originally part of my project proposal, which was submitted before we learned how to create maps in R. I included the map in my final project because I felt that it was an appropriate visual guide to comparing each state’s mean SAT score.