Wine Project Report

Introduction

This report aims to analyze a dataset of wines to evaluate what characteristics of a wine affect its perceived quality, and what characteristics are correlationally related to each other.

The dataset for my analysis is available on Kaggle at https://www.kaggle.com/zynicide/wine-reviews#winemag-data_first150k.csv.

Setup

Library imports:

library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(knitr)
library(readr)
library(maps)
library(mapproj)

Data import:

df <- read_csv("winemag-data_first150k.csv")

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   country = col_character(),
##   description = col_character(),
##   designation = col_character(),
##   points = col_integer(),
##   price = col_double(),
##   province = col_character(),
##   region_1 = col_character(),
##   region_2 = col_character(),
##   variety = col_character(),
##   winery = col_character()
## )

Summary

Here are some of the relevant columns in the dataset to our analysis. I’ve excluded data which is non-numerical, non-categorical and/or overly specific and thus irrelevant (such as the name of the particular winery, the vineyard within the winery, and the paragraph-long description of the wine itself) for the sake of concision.

df <- df %>% select(country, points, price, province, region_1, region_2, variety)
kable(head(df))

country	points	price	province	region_1	region_2	variety
US	96	235	California	Napa Valley	Napa	Cabernet Sauvignon
Spain	96	110	Northern Spain	Toro	NA	Tinta de Toro
US	96	90	California	Knights Valley	Sonoma	Sauvignon Blanc
US	96	65	Oregon	Willamette Valley	Willamette Valley	Pinot Noir
France	95	66	Provence	Bandol	NA	Provence red blend
Spain	95	73	Northern Spain	Toro	NA	Tinta de Toro

Data Analysis

Price vs. Quality

The following is a plot of price versus points of wines. The point system is the rating out of 100 that WineEnthusiast gave the wine. Please note that WineEnthusiast only posts ratings for wines rated 80 points or above. Also, I’ve restricted the x-axis because certain exorbitantly expensive wines made the rest of the relevant data difficult to see.

The below plot attempts to show how wine quality (judged by the WineEnthusiast point system) varies based on price.

ggplot(data = df, mapping = aes(x = price, y = points)) +
  geom_point(alpha = 0.05) + geom_smooth(color = "rosybrown1") + xlim(0, 1200)

As this graph shows, the point ratings do increase drastically with price from the $0 to $250 range, but as the price exceeds around $500, the trend plateaus – we aren’t seeing as much of a significant increase in quality. If you’re buying $500+ wine, it’s probably not so much about quality wine as it is proving to your friends that you’re extravagantly rich.

Ranking Average Point Scores by Country

The following is a ranked list of the average points per country that aims to show which countries produce the best wine. I also included a column showing the total number of wines reviewed, to ensure that any conclusions drawn are put into context.

avg_points_by_country <- df %>%
filter(!is.na(country)) %>%
group_by(country) %>%
summarize(mean = mean(points, na.rm=TRUE), total_wines = n())
kable(avg_points_by_country %>% arrange(desc(mean)))

country	mean	total_wines
England	92.88889	9
Austria	89.27674	3057
France	88.92587	21098
Germany	88.62643	2452
Italy	88.41366	23478
Canada	88.23980	196
Slovenia	88.23404	94
Morocco	88.16667	12
Turkey	88.09615	52
Portugal	88.05769	5322
Albania	88.00000	2
US-France	88.00000	1
Australia	87.89248	4957
US	87.81879	62397
Serbia	87.71429	14
India	87.62500	8
New Zealand	87.55422	3320
Hungary	87.32900	231
Switzerland	87.25000	4
South Africa	87.22542	2258
Israel	87.17619	630
Luxembourg	87.00000	9
Spain	86.64659	8268
Chile	86.29677	5816
Croatia	86.28090	89
Greece	86.11765	884
Tunisia	86.00000	2
Argentina	85.99609	5631
Cyprus	85.87097	31
Czech Republic	85.83333	6
Lebanon	85.70270	37
Georgia	85.51163	43
Bulgaria	85.46753	77
Japan	85.00000	2
Romania	84.92086	139
Macedonia	84.81250	16
Mexico	84.76190	63
Bosnia and Herzegovina	84.75000	4
Moldova	84.71831	71
Ukraine	84.60000	5
Uruguay	84.47826	92
Lithuania	84.25000	8
Egypt	83.66667	3
Slovakia	83.66667	3
Brazil	83.24000	25
China	82.00000	3
Montenegro	82.00000	2
South Korea	81.50000	4

Surprisingly (to me at least), England topped list, even higher than France and Italy. However, it’s worth noting that only 9 English wines were sampled. We can’t really assert that English wines are by and large the best wines because of their high point mean alone; we’ve only really looked at a very, very small sample of English wines (compared to, say, France, for which we’ve analyzed 21, 098 wines).

I decided that a more accurate way to visualize this would be a scatterplot, shown below. Please note that I’ve only used the 5 top-ranked countries, because there are too many countries to plot. I also overlaid a box plot so the means and distributions of data can be better understood.

top_5 <- df %>%
  filter(df$country == "England" | 
           df$country == "Austria" |
           df$country == "France" |
           df$country == "Germany" |
           df$country == "Italy") %>%
  mutate(country =  factor(country, 
                    levels = c("England","Austria","France","Germany","Italy"))) %>%
  arrange(country)
ggplot(data = top_5, mapping = aes(x = as.factor(country), y = points)) +
  geom_point(alpha = 0.05, position = "jitter") +
  labs(x = "Country", y = "Points") +
  geom_boxplot(color = "rosybrown1", alpha = 0)

Here we have a more accurate visual representation of the rank of average point values per country. We can see that while England has the highest mean, it has extremely few data points compared to the other countries ranking top 5.

Here’s a world map of the average point values per country. I’ve excluded Antarctica, because as far as I know, penguins aren’t very prolific winemakers.

world <- map_data("world") %>% left_join(avg_points_by_country, by = c("region" = "country"))
ggplot(data = world, 
               mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = mean)) +
  scale_fill_continuous(low="rosybrown1", high="darkred", 
                        na.value="snow2") +
  coord_map(xlim = c(-180,180), ylim = c(-60, 80)) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x  = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y  = element_blank(),
    axis.ticks.y = element_blank(),
    panel.background = element_rect(fill = "white"))

Countries with a higher average point score are shown in a darker red (a nice Cabernet Sauvignon, perhaps?), while those with a lower average point score are shown in a paler pink. Countries that produced no wine in the dataset are shown in grey. Unsurprisingly, Europe seems to have the highest concentration of good-wine-producing countries. Visually, Canada, Australia, India, and South Africa are doing pretty good too. China’s doing its best, but its wines are apparently pretty bad (that said, we’ve only looked at 3).

Conclusion

This analysis of wine has demonstrated that, generally, if you’re looking for a pretty solid wine, your best bet is a European wine (particularly French or Italian) within your price range. Unless your price range is $500+, in which case maybe reevaluate your priorities and opt for a wine around $250 instead, because chances are it’ll be of a similar quality.

Some deviations from my project proposal: I naively didn’t realize that grapes used in winemaking are not simply “red” or “not red” but instead take on any of 632 different types. I wanted to do analysis of white versus red wine, but found it difficult to do so given that I’d have to somehow classify the 632 grapes into red or white.

I also thought I’d be able to do more map-based analysis on increasingly smaller scales (USA, then California, etc.) but found that difficult also, because the region labels in the wine dataset don’t match well with those in map_data. To be specific, while “province” within “US” does have certain state values, some wines are labelled with a province of “America”, which provides no additional useful information but would be misleading to exclude. Further, the regions within California do not align with county lines and are also hard to plot for this reason. Lastly, I wish I knew how to do text-based analysis/parsing, because I feel like looking at keywords from the descriptions of wines would also be really interesting. Maybe next time!