Introduction

This report aims to analyze a dataset of wines to evaluate what characteristics of a wine affect its perceived quality, and what characteristics are correlationally related to each other.

The dataset for my analysis is available on Kaggle at https://www.kaggle.com/zynicide/wine-reviews#winemag-data_first150k.csv.

Setup

Library imports:

library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(knitr)
library(readr)
library(maps)
library(mapproj)

Data import:

df <- read_csv("winemag-data_first150k.csv")
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   country = col_character(),
##   description = col_character(),
##   designation = col_character(),
##   points = col_integer(),
##   price = col_double(),
##   province = col_character(),
##   region_1 = col_character(),
##   region_2 = col_character(),
##   variety = col_character(),
##   winery = col_character()
## )

Summary

Here are some of the relevant columns in the dataset to our analysis. I’ve excluded data which is non-numerical, non-categorical and/or overly specific and thus irrelevant (such as the name of the particular winery, the vineyard within the winery, and the paragraph-long description of the wine itself) for the sake of concision.

df <- df %>% select(country, points, price, province, region_1, region_2, variety)
kable(head(df))
country points price province region_1 region_2 variety
US 96 235 California Napa Valley Napa Cabernet Sauvignon
Spain 96 110 Northern Spain Toro NA Tinta de Toro
US 96 90 California Knights Valley Sonoma Sauvignon Blanc
US 96 65 Oregon Willamette Valley Willamette Valley Pinot Noir
France 95 66 Provence Bandol NA Provence red blend
Spain 95 73 Northern Spain Toro NA Tinta de Toro

Data Analysis

Price vs. Quality

The following is a plot of price versus points of wines. The point system is the rating out of 100 that WineEnthusiast gave the wine. Please note that WineEnthusiast only posts ratings for wines rated 80 points or above. Also, I’ve restricted the x-axis because certain exorbitantly expensive wines made the rest of the relevant data difficult to see.

The below plot attempts to show how wine quality (judged by the WineEnthusiast point system) varies based on price.

ggplot(data = df, mapping = aes(x = price, y = points)) +
  geom_point(alpha = 0.05) + geom_smooth(color = "rosybrown1") + xlim(0, 1200)

As this graph shows, the point ratings do increase drastically with price from the $0 to $250 range, but as the price exceeds around $500, the trend plateaus – we aren’t seeing as much of a significant increase in quality. If you’re buying $500+ wine, it’s probably not so much about quality wine as it is proving to your friends that you’re extravagantly rich.

Ranking Average Point Scores by Country

The following is a ranked list of the average points per country that aims to show which countries produce the best wine. I also included a column showing the total number of wines reviewed, to ensure that any conclusions drawn are put into context.

avg_points_by_country <- df %>%
filter(!is.na(country)) %>%
group_by(country) %>%
summarize(mean = mean(points, na.rm=TRUE), total_wines = n())
kable(avg_points_by_country %>% arrange(desc(mean)))
country mean total_wines
England 92.88889 9
Austria 89.27674 3057
France 88.92587 21098
Germany 88.62643 2452
Italy 88.41366 23478
Canada 88.23980 196
Slovenia 88.23404 94
Morocco 88.16667 12
Turkey 88.09615 52
Portugal 88.05769 5322
Albania 88.00000 2
US-France 88.00000 1
Australia 87.89248 4957
US 87.81879 62397
Serbia 87.71429 14
India 87.62500 8
New Zealand 87.55422 3320
Hungary 87.32900 231
Switzerland 87.25000 4
South Africa 87.22542 2258
Israel 87.17619 630
Luxembourg 87.00000 9
Spain 86.64659 8268
Chile 86.29677 5816
Croatia 86.28090 89
Greece 86.11765 884
Tunisia 86.00000 2
Argentina 85.99609 5631
Cyprus 85.87097 31
Czech Republic 85.83333 6
Lebanon 85.70270 37
Georgia 85.51163 43
Bulgaria 85.46753 77
Japan 85.00000 2
Romania 84.92086 139
Macedonia 84.81250 16
Mexico 84.76190 63
Bosnia and Herzegovina 84.75000 4
Moldova 84.71831 71
Ukraine 84.60000 5
Uruguay 84.47826 92
Lithuania 84.25000 8
Egypt 83.66667 3
Slovakia 83.66667 3
Brazil 83.24000 25
China 82.00000 3
Montenegro 82.00000 2
South Korea 81.50000 4

Surprisingly (to me at least), England topped list, even higher than France and Italy. However, it’s worth noting that only 9 English wines were sampled. We can’t really assert that English wines are by and large the best wines because of their high point mean alone; we’ve only really looked at a very, very small sample of English wines (compared to, say, France, for which we’ve analyzed 21, 098 wines).

I decided that a more accurate way to visualize this would be a scatterplot, shown below. Please note that I’ve only used the 5 top-ranked countries, because there are too many countries to plot. I also overlaid a box plot so the means and distributions of data can be better understood.

top_5 <- df %>%
  filter(df$country == "England" | 
           df$country == "Austria" |
           df$country == "France" |
           df$country == "Germany" |
           df$country == "Italy") %>%
  mutate(country =  factor(country, 
                    levels = c("England","Austria","France","Germany","Italy"))) %>%
  arrange(country)
ggplot(data = top_5, mapping = aes(x = as.factor(country), y = points)) +
  geom_point(alpha = 0.05, position = "jitter") +
  labs(x = "Country", y = "Points") +
  geom_boxplot(color = "rosybrown1", alpha = 0)

Here we have a more accurate visual representation of the rank of average point values per country. We can see that while England has the highest mean, it has extremely few data points compared to the other countries ranking top 5.

Here’s a world map of the average point values per country. I’ve excluded Antarctica, because as far as I know, penguins aren’t very prolific winemakers.

world <- map_data("world") %>% left_join(avg_points_by_country, by = c("region" = "country"))
ggplot(data = world, 
               mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = mean)) +
  scale_fill_continuous(low="rosybrown1", high="darkred", 
                        na.value="snow2") +
  coord_map(xlim = c(-180,180), ylim = c(-60, 80)) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x  = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.y  = element_blank(),
    axis.ticks.y = element_blank(),
    panel.background = element_rect(fill = "white"))

Countries with a higher average point score are shown in a darker red (a nice Cabernet Sauvignon, perhaps?), while those with a lower average point score are shown in a paler pink. Countries that produced no wine in the dataset are shown in grey. Unsurprisingly, Europe seems to have the highest concentration of good-wine-producing countries. Visually, Canada, Australia, India, and South Africa are doing pretty good too. China’s doing its best, but its wines are apparently pretty bad (that said, we’ve only looked at 3).

Conclusion

This analysis of wine has demonstrated that, generally, if you’re looking for a pretty solid wine, your best bet is a European wine (particularly French or Italian) within your price range. Unless your price range is $500+, in which case maybe reevaluate your priorities and opt for a wine around $250 instead, because chances are it’ll be of a similar quality.

Some deviations from my project proposal: I naively didn’t realize that grapes used in winemaking are not simply “red” or “not red” but instead take on any of 632 different types. I wanted to do analysis of white versus red wine, but found it difficult to do so given that I’d have to somehow classify the 632 grapes into red or white.

I also thought I’d be able to do more map-based analysis on increasingly smaller scales (USA, then California, etc.) but found that difficult also, because the region labels in the wine dataset don’t match well with those in map_data. To be specific, while “province” within “US” does have certain state values, some wines are labelled with a province of “America”, which provides no additional useful information but would be misleading to exclude. Further, the regions within California do not align with county lines and are also hard to plot for this reason. Lastly, I wish I knew how to do text-based analysis/parsing, because I feel like looking at keywords from the descriptions of wines would also be really interesting. Maybe next time!