This report aims to analyze a dataset of wines to evaluate what characteristics of a wine affect its perceived quality, and what characteristics are correlationally related to each other.
The dataset for my analysis is available on Kaggle at https://www.kaggle.com/zynicide/wine-reviews#winemag-data_first150k.csv.
Library imports:
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(knitr)
library(readr)
library(maps)
library(mapproj)
Data import:
df <- read_csv("winemag-data_first150k.csv")
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## country = col_character(),
## description = col_character(),
## designation = col_character(),
## points = col_integer(),
## price = col_double(),
## province = col_character(),
## region_1 = col_character(),
## region_2 = col_character(),
## variety = col_character(),
## winery = col_character()
## )
Here are some of the relevant columns in the dataset to our analysis. I’ve excluded data which is non-numerical, non-categorical and/or overly specific and thus irrelevant (such as the name of the particular winery, the vineyard within the winery, and the paragraph-long description of the wine itself) for the sake of concision.
df <- df %>% select(country, points, price, province, region_1, region_2, variety)
kable(head(df))
country | points | price | province | region_1 | region_2 | variety |
---|---|---|---|---|---|---|
US | 96 | 235 | California | Napa Valley | Napa | Cabernet Sauvignon |
Spain | 96 | 110 | Northern Spain | Toro | NA | Tinta de Toro |
US | 96 | 90 | California | Knights Valley | Sonoma | Sauvignon Blanc |
US | 96 | 65 | Oregon | Willamette Valley | Willamette Valley | Pinot Noir |
France | 95 | 66 | Provence | Bandol | NA | Provence red blend |
Spain | 95 | 73 | Northern Spain | Toro | NA | Tinta de Toro |
The following is a plot of price versus points of wines. The point system is the rating out of 100 that WineEnthusiast gave the wine. Please note that WineEnthusiast only posts ratings for wines rated 80 points or above. Also, I’ve restricted the x-axis because certain exorbitantly expensive wines made the rest of the relevant data difficult to see.
The below plot attempts to show how wine quality (judged by the WineEnthusiast point system) varies based on price.
ggplot(data = df, mapping = aes(x = price, y = points)) +
geom_point(alpha = 0.05) + geom_smooth(color = "rosybrown1") + xlim(0, 1200)
As this graph shows, the point ratings do increase drastically with price from the $0 to $250 range, but as the price exceeds around $500, the trend plateaus – we aren’t seeing as much of a significant increase in quality. If you’re buying $500+ wine, it’s probably not so much about quality wine as it is proving to your friends that you’re extravagantly rich.
The following is a ranked list of the average points per country that aims to show which countries produce the best wine. I also included a column showing the total number of wines reviewed, to ensure that any conclusions drawn are put into context.
avg_points_by_country <- df %>%
filter(!is.na(country)) %>%
group_by(country) %>%
summarize(mean = mean(points, na.rm=TRUE), total_wines = n())
kable(avg_points_by_country %>% arrange(desc(mean)))
country | mean | total_wines |
---|---|---|
England | 92.88889 | 9 |
Austria | 89.27674 | 3057 |
France | 88.92587 | 21098 |
Germany | 88.62643 | 2452 |
Italy | 88.41366 | 23478 |
Canada | 88.23980 | 196 |
Slovenia | 88.23404 | 94 |
Morocco | 88.16667 | 12 |
Turkey | 88.09615 | 52 |
Portugal | 88.05769 | 5322 |
Albania | 88.00000 | 2 |
US-France | 88.00000 | 1 |
Australia | 87.89248 | 4957 |
US | 87.81879 | 62397 |
Serbia | 87.71429 | 14 |
India | 87.62500 | 8 |
New Zealand | 87.55422 | 3320 |
Hungary | 87.32900 | 231 |
Switzerland | 87.25000 | 4 |
South Africa | 87.22542 | 2258 |
Israel | 87.17619 | 630 |
Luxembourg | 87.00000 | 9 |
Spain | 86.64659 | 8268 |
Chile | 86.29677 | 5816 |
Croatia | 86.28090 | 89 |
Greece | 86.11765 | 884 |
Tunisia | 86.00000 | 2 |
Argentina | 85.99609 | 5631 |
Cyprus | 85.87097 | 31 |
Czech Republic | 85.83333 | 6 |
Lebanon | 85.70270 | 37 |
Georgia | 85.51163 | 43 |
Bulgaria | 85.46753 | 77 |
Japan | 85.00000 | 2 |
Romania | 84.92086 | 139 |
Macedonia | 84.81250 | 16 |
Mexico | 84.76190 | 63 |
Bosnia and Herzegovina | 84.75000 | 4 |
Moldova | 84.71831 | 71 |
Ukraine | 84.60000 | 5 |
Uruguay | 84.47826 | 92 |
Lithuania | 84.25000 | 8 |
Egypt | 83.66667 | 3 |
Slovakia | 83.66667 | 3 |
Brazil | 83.24000 | 25 |
China | 82.00000 | 3 |
Montenegro | 82.00000 | 2 |
South Korea | 81.50000 | 4 |
Surprisingly (to me at least), England topped list, even higher than France and Italy. However, it’s worth noting that only 9 English wines were sampled. We can’t really assert that English wines are by and large the best wines because of their high point mean alone; we’ve only really looked at a very, very small sample of English wines (compared to, say, France, for which we’ve analyzed 21, 098 wines).
I decided that a more accurate way to visualize this would be a scatterplot, shown below. Please note that I’ve only used the 5 top-ranked countries, because there are too many countries to plot. I also overlaid a box plot so the means and distributions of data can be better understood.
top_5 <- df %>%
filter(df$country == "England" |
df$country == "Austria" |
df$country == "France" |
df$country == "Germany" |
df$country == "Italy") %>%
mutate(country = factor(country,
levels = c("England","Austria","France","Germany","Italy"))) %>%
arrange(country)
ggplot(data = top_5, mapping = aes(x = as.factor(country), y = points)) +
geom_point(alpha = 0.05, position = "jitter") +
labs(x = "Country", y = "Points") +
geom_boxplot(color = "rosybrown1", alpha = 0)
Here we have a more accurate visual representation of the rank of average point values per country. We can see that while England has the highest mean, it has extremely few data points compared to the other countries ranking top 5.
Here’s a world map of the average point values per country. I’ve excluded Antarctica, because as far as I know, penguins aren’t very prolific winemakers.
world <- map_data("world") %>% left_join(avg_points_by_country, by = c("region" = "country"))
ggplot(data = world,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = mean)) +
scale_fill_continuous(low="rosybrown1", high="darkred",
na.value="snow2") +
coord_map(xlim = c(-180,180), ylim = c(-60, 80)) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.background = element_rect(fill = "white"))
Countries with a higher average point score are shown in a darker red (a nice Cabernet Sauvignon, perhaps?), while those with a lower average point score are shown in a paler pink. Countries that produced no wine in the dataset are shown in grey. Unsurprisingly, Europe seems to have the highest concentration of good-wine-producing countries. Visually, Canada, Australia, India, and South Africa are doing pretty good too. China’s doing its best, but its wines are apparently pretty bad (that said, we’ve only looked at 3).
This analysis of wine has demonstrated that, generally, if you’re looking for a pretty solid wine, your best bet is a European wine (particularly French or Italian) within your price range. Unless your price range is $500+, in which case maybe reevaluate your priorities and opt for a wine around $250 instead, because chances are it’ll be of a similar quality.
Some deviations from my project proposal: I naively didn’t realize that grapes used in winemaking are not simply “red” or “not red” but instead take on any of 632 different types. I wanted to do analysis of white versus red wine, but found it difficult to do so given that I’d have to somehow classify the 632 grapes into red or white.
I also thought I’d be able to do more map-based analysis on increasingly smaller scales (USA, then California, etc.) but found that difficult also, because the region labels in the wine dataset don’t match well with those in map_data. To be specific, while “province” within “US” does have certain state values, some wines are labelled with a province of “America”, which provides no additional useful information but would be misleading to exclude. Further, the regions within California do not align with county lines and are also hard to plot for this reason. Lastly, I wish I knew how to do text-based analysis/parsing, because I feel like looking at keywords from the descriptions of wines would also be really interesting. Maybe next time!