An (Incomprehensive) Analysis of the College Scorecard Data

Introduction

This is an analysis of the relationship between the median earnings of a college’s graduates and other characteristics of the college. The data was obtained from the College Scorecard dataset produced by the U.S. Department of Education and can be found here. The U.S. Department of Education obtains the data through several sources including “federal reporting from institutions, data on federal financial aid, and tax information.”

In this analysis, I wish to use the R programming language to visually and quantitatively determine the effect several numerical and categorical variables have on the post-graduation median earnings of college students. The reason for inquiry lies in my interest in determining trends regarding the quality of universities in relationship to a variety of economic and academic factors. In determining the effect of certain variables on eventual economic outcome, I seek to better understand the tradeoffs of different college characteristics.

Data Import and Processing

Library imports:

library(dplyr)
library(ggplot2)
library(scales)
library(readr)
library(knitr)

Data import:

df <- read_csv("Most-Recent-Cohorts-All-Data-Elements.csv")

There are 7593 colleges in this dataset with 1777 variables. For this analysis, we are only interested in several defining characteristics (e.g. institution name), categorical factors (e.g. operational status), and quantitative variables (e.g. admission rate).

The following code selects only the variables relevant to our analysis and only retains colleges that:

are currently operating
are in the 50 states (i.e. excluding territories)
are not for-profit institutions
predominantly award bachelor’s degrees

college <- df %>%
  filter(CURROPER == 1,
         ST_FIPS <= 56,
         CONTROL != 3,
         PREDDEG == 3) %>%
  select(name = INSTNM, funding = CONTROL, admit = ADM_RATE, med_earnings = MD_EARN_WNE_P10,
         med_fam_inc = MD_FAMINC, NPT4_PUB, NPT4_PRIV)

In order to create visualizations of the data, we need to convert the variables admit, med_earnings, med_fam_inc, NPT4_PUB, and NPT4_PRIV from the character type to the double type. The following code also merges the NPT4_PUB column and NPT4_PRIV column into a single price column.

college$admit <- as.double(college$admit)
college$med_earnings <- as.double(college$med_earnings)
college$med_fam_inc <- as.double(college$med_fam_inc)
college$NPT4_PUB <- as.double(college$NPT4_PUB)
college$NPT4_PRIV <- as.double(college$NPT4_PRIV)

college <- college %>%
  rowwise %>%
  mutate(price = sum(NPT4_PUB, NPT4_PRIV, na.rm = TRUE)) %>%
  select(-c(NPT4_PUB, NPT4_PRIV))

college["price"][college["price"] == 0] <- NA

Here is a small view of what the data now looks like:

kable(head(college))

name	funding	admit	med_earnings	med_fam_inc	price
Alabama A & M University	1	0.6538	29900	21429.0	13435
University of Alabama at Birmingham	1	0.6043	40200	33731.0	16023
Amridge University	2	NA	40100	14631.0	8862
University of Alabama in Huntsville	1	0.8120	45600	39100.5	18661
Alabama State University	1	0.4639	26700	21704.0	7400
The University of Alabama	1	0.5359	42700	64600.5	20575

The variables selected are:

names(college)

## [1] "name"         "funding"      "admit"        "med_earnings"
## [5] "med_fam_inc"  "price"

funding refers to the source of funding for the institution, 1 coding for a public university and 2 coding for a private university
admit is the admission rate of the institution on a scale of 0 to 1
med_earnings represents the median earnings of the institution’s students who are employed 10 years after enrollment in 2015 USD
med_fam_inc is the median family income of the institution’s current students in 2015 USD
price indicates the average net price of attendance in USD accounting for the full costs of attendance and awarded financial aid

The following code stores certain purely cosmetic alterations of the visualizations as the variables xdollar, ydollar, and titling to allow for cleaner looking code.

xdollar <- c(scale_x_continuous(labels = dollar,
                              breaks = seq(0, 130000, 25000),
                              limits = c(0, NA)))

ydollar <- c(scale_y_continuous(labels = dollar,
                                        breaks = seq(0, 130000, 25000),
                                        limits = c(0, NA)))

titling <- theme(plot.title = element_text(hjust = 0.5,
                                           face = "bold"),
                 axis.title.x = element_text(face = "bold"),
                 axis.title.y = element_text(face = "bold"))

Distribution of Median Earnings

To analyze the relationship between median earnings and other factors, we would first like to get a preliminary understanding of the distribution of the median earnings of colleges’ graduates. We will create a boxplot of med_earnings below:

ggplot(data = college) +
  geom_boxplot(mapping = aes(x = "", y = med_earnings)) +
  labs(title = "Median Earnings of \na College's Graduates",
       x = NULL,
       y = "Median Earnings in USD") +
  ydollar +
  titling

Due to the inclusion of outliers in our boxplot, we do not receive a good representation of the scale of the distribution of median earnings. Let’s take a look at a histogram instead:

ggplot(data = college) +
  geom_histogram(mapping = aes(x = med_earnings)) +
  labs(title = "Median Earnings of a College's Graduates",
       x = "Median Earnings in USD",
       y = "Frequency of Colleges") +
  xdollar +
  titling

Differences between the Median Earnings of Public Universities versus Private Universities

A histogram of med_earnings gives us a better visualization of the distribution of the median earnings of colleges’ graduates. However, to better understand the effect of different characteristics on economic outcomes, we would like to separate and compare the distributions of med_earnings between public and private institutions.

A violin plot should combine the compactness of a boxplot with the visualization of the distribution of a histogram. Additionally, I have overlaid a plot of the data points to further aid with visualizing the distribution. Here is a violin plot of med_earnings separated by the two values of funding:

ggplot(data = college,
       mapping = aes(x = factor(funding),
                     y = med_earnings)) +
  geom_violin() +
  geom_jitter(alpha = 0.15) +
  scale_x_discrete(labels = c("Public", "Private")) +
  labs(title = "Median Earnings of a College's Graduates \nby Source of Funding",
       x = "Funding Source",
       y = "Median Earnings in USD") +
  ydollar +
  titling

Constructing a violin plot does well in illustrating the visible difference in the distributions of med_earnings between different kinds of universities. However, we should also verify the difference in distribution computationally. We will conduct a Kolmogorov-Smirnov test to determine if the difference in the distribution of the median earnings of colleges’ graduates between public and private institutions is statistically significant. For this analysis, we will consider any p-value less than 0.05 to be statistically significant.

public <- (college %>% filter(funding == 1))$med_earnings
private <- (college %>% filter(funding == 2))$med_earnings
ks.test(public, private, "two.sided")

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  public and private
## D = 0.085491, p-value = 0.009165
## alternative hypothesis: two-sided

With an incredibly small p-value of 0.009165, we reject the null hypothesis that public and private colleges share the same distribution of the median earnings of their graduates in favor of our alternative hypothesis that the two different types of institutions have statistically significant differences between their distributions of med_earnings.

Determining a Relationship between Median Earnings and Other Factors

In order to further examine the effect various factors have on the economic outcome of a college’s graduates, it would serve us well to fit a linear model to the data. Constructing a least squares regression line using med_earnings and another variable will allow us to quantitatively observe the two variables’ relationship and determine the strength of the relationship, which might allow us to establish which variables are better predictors of good economic incomes than others.

We will utilize scatterplots to visibly observe the relationships between variables. The code below stores certain functions as variables to allow for neater code.

point_theme affects nothing more than some simple cosmetics of the created plots
scatter stores the bulk of the coding that declares the variables of interest to be plotted and the type of plot to be constructed, as well as some cosmetic alterations

point_theme <- c(scale_x_continuous(labels = percent),
                 ydollar,
                 scale_color_manual(labels = c("Public","Private"),
                                    values = c("#F8766D", "#00BFC4")))

scatter <- ggplot(data = college,
                  mapping = aes(x = admit,
                                y = med_earnings)) +
  geom_point(mapping = aes(color = factor(funding)),
             size = 2) +
  point_theme +
  titling

Relationship between Median Earnings and Admission Rate

We would first like to observe the effect that a college’s admission rate has on the eventual median earnings of its graduates; we will construct a scatterplot to visualize this relationship:

scatter +
  labs(title = "Median Earnings of Graduates against Admission Rate of Colleges",
       x = "Admission Rate",
       y = "Median Earnings in USD",
       color = "Source of Funding")

By constructing a scatterplot of med_earnings against admit, we can observe the negative relationship between the median earnings of colleges’ graduates and the colleges’ admission rates. A lower admission rate is correlated with higher median earnings. To further verify the negative relationship between med_earnings and admit, we will fit a linear model to the scatterplot:

scatter +
  geom_smooth(method = "lm",
              color = "black") +
  labs(title = "Median Earnings of Graduates against Admission Rate of Colleges",
       x = "Admission Rate",
       y = "Median Earnings in USD",
       color = "Source of Funding")

fit1 <- lm(data = college,
           med_earnings ~ admit)
summary(fit1)

## 
## Call:
## lm(formula = med_earnings ~ admit, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27624  -6376   -755   5106  79688 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53340.8      952.8   55.98   <2e-16 ***
## admit       -15831.6     1396.5  -11.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10390 on 1455 degrees of freedom
##   (339 observations deleted due to missingness)
## Multiple R-squared:  0.08116,    Adjusted R-squared:  0.08053 
## F-statistic: 128.5 on 1 and 1455 DF,  p-value: < 2.2e-16

The above is a summary of the least squares regression line fit to the med_earnings and admit scatterplot. Most of the information given is not pertinent to the scope of our analysis. One should note the first values given for the (Intercept) and admit coefficients and the Multiple R-squared value.

The coefficients give us the equation for the least squares regression line: $med\_earnings = -15831.6 * admit + 53340.8$ where admit is on a scale of 0 to 1. The product -15831.6 * admit means that for every 1% increase in the admission rate, the median earnings for a student 10 years after enrollment is expected to decrease by $158.32. The y-intercept of 53340.8 implies that, for an institution with 0% admission rate, the expected median earnings would be $53,340.80.

The Multiple R-squared value of 0.08116 means that 8.12% of the variation in median earnings can be explained by the linear relationship between median earnings and admission rate. This is a rather low value for R². It is possible that a linear model is not the best fit for the data. Let’s plot med_earnings against admit but separate public and private colleges:

scatter +
  labs(title = "Median Earnings of Graduates against Admission Rate of Colleges \nSeparated by Source of Funding",
       x = "Admission Rate",
       y = "Median Earnings in USD",
       color = "Source of Funding") +
  facet_wrap(~ funding)

By separating the two different types of institutions, we can see that private universities with low admission rates do not follow a linear pattern as much as institutions with high admission rates. This could explain a low R² value.

Let’s explore other factors to see which variables might better predict median earnings.

Relationship between Median Earnings and Median Family Income

We would like to observe the relationship between the median family income of current students at these institutions and the eventual median earnings of their graduates. To do so, we will plot med_earnings against med_fam_inc and fit a linear model to the result:

ggplot(data = college,
       mapping = aes(x = med_fam_inc,
                     y = med_earnings)) +
  geom_point(size = 2,
             color = "skyblue2") +
  geom_smooth(method = "lm",
              color = "black") +
  labs(title = "Median Earnings of Graduates against Median Family Income of Current Students",
       x = "Median Family Income in USD",
       y = "Median Earnings in USD") +
  xdollar +
  ydollar +
  titling

fit2 <- lm(data = college,
           med_earnings ~ med_fam_inc)
summary(fit2)

## 
## Call:
## lm(formula = med_earnings ~ med_fam_inc, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32127  -6140  -2113   3923  88726 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.134e+04  6.767e+02   46.31   <2e-16 ***
## med_fam_inc 2.344e-01  1.257e-02   18.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10780 on 1620 degrees of freedom
##   (174 observations deleted due to missingness)
## Multiple R-squared:  0.1768, Adjusted R-squared:  0.1763 
## F-statistic:   348 on 1 and 1620 DF,  p-value: < 2.2e-16

Again, note the coefficients for (Intercept) and med_fam_inc and the value of Multiple R-squared. The coefficients give us the equation to the least squares regression line for this plot: $med\_earnings = 0.2344 * med\_fam\_inc + 31340$ . The product 0.2344 * med_fam_inc indicates that for every increase of $1000 in median family income, the median earnings is expected to increase by $234.40. The y-intercept of 31340 indicates that, for a university with students with a median family income of $0, the expected median earnings would be $31,340.

The Multiple R-squared value of 0.1768 means that 17.68% of the variation in median earnings can be explained by the linear relationship between median earnings and median family income. This is a higher R² value than the R² computed for the med_earnings vs. admit plot, indicating that the median family income of a college’s students is a better predictor for the median earnings of that institution’s graduates than the admission rate of that institution.

Next, let’s examine the relationship between median earnings and the net price of attendance.

Relationship between Median Earnings and Net Price of Attendance

We would like to determine if the net price of attending an institution is a better predictor of the median earnings of the graduates of that institution than median family income or admission rate. To visualize the relationship, we will plot med_earnings against price and apply a linear model to the data:

ggplot(data = college,
       mapping = aes(x = price,
                     y = med_earnings)) +
  geom_point(size = 2,
             color = "thistle3") +
  geom_smooth(method = "lm",
              color = "black") +
  labs(title = "Median Earnings of Graduates against Net Price of Attendance",
       x = "Net Price in USD",
       y = "Median Earnings in USD") +
  xdollar +
  ydollar +
  titling

fit3 <- lm(data = college,
           med_earnings ~ price)
summary(fit3)

## 
## Call:
## lm(formula = med_earnings ~ price, data = college)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30182  -5703   -947   4629  74205 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.203e+04  7.599e+02   42.15   <2e-16 ***
## price       5.269e-01  3.634e-02   14.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10330 on 1550 degrees of freedom
##   (244 observations deleted due to missingness)
## Multiple R-squared:  0.1194, Adjusted R-squared:  0.1189 
## F-statistic: 210.2 on 1 and 1550 DF,  p-value: < 2.2e-16

Once more, note the coefficients for (Intercept) and price and the value of Multiple R-squared. The equation for the least squares regression line is: $med\_earnings = 0.5269 * price + 32030$ . The product 0.5269 * price suggests that for every increase of $1000 in net price, the median earnings is expected to increase by $526.90. The y-intercept of 32030 indicates that, for an institution with a net price of $0, the expected median earnings would be $32,030.

The Multiple R-squared value of 0.1194 means that 11.94% of the variation in median earnings can be explained by the linear relationship between median earnings and net price of attendance. Although this R² value is higher than the one computed for the med_earnings vs. admit plot, it is lower than the R² value computed for med_earnings vs. med_fam_inc. This suggests that the net price of an institution is a better predictor of the median earnings of that institution’s graduates than the admission rate of that college. However, net price is not as good of a predictor as the median family income of an institution’s students.

Conclusion

I began this analysis introducing the goals I wanted to achieve through the use of the R programming language. Through the construction of several visualizations, we have examined the effect of several numerical and categorical variables on the economic outcome of college graduates: the median earnings 10 years after enrollment.

Since the R² value of the relationship between the median earnings of a college’s graduates and the median family income of the college’s students is the highest among the three variables examined (i.e. admit, med_fam_inc, and price), it is the better predictor of the potential economic outcome of an institution’s students.

I look forward to conducting a further analysis of the effect of more variables on the ultimate economic outcome of students in post-secondary education.