Introduction

This is an analysis of 1795 chocolate bars from around the world. From the data, we will be looking at the chocolate bars’ various manufacturing locations, country of origin of the cacao beans, and overall rating, among a variety of other miscellaneous information. The data was gathered from Brady Brelinski of the Manhattan Chocolate Society, and is accessible through Kaggle.

Data Import and Headings

library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(knitr)
df1 <- read.csv("flavors_of_cacao.csv")

The data set that we have imported contains a bit more information than what we’ll be looking at. For what we’re interested in, we’ll trim down to what we’ll be analyzing:

df1 <- df1 %>% select(Cocoa.Percent, Company.Location, Rating, Broad.Bean.Origin)
kable(head(df1))
Cocoa.Percent Company.Location Rating Broad.Bean.Origin
63% France 3.75 Sao Tome
70% France 2.75 Togo
70% France 3.00 Togo
70% France 3.50 Togo
70% France 3.50 Peru
70% France 2.75 Venezuela

Data Analysis

Before we start any real analysis, it would be a good idea to first get an idea of our rating system. Rating systems oftentimes can seem arbitrary or even plain unfair. If it turns out that our ratings look skewed towards a particular direction, we might want to proceed with a bit more caution as we continue on with our analysis.

Note: the ratings that we imported were represented as characters, so we’ll convert them to doubles before we plot.

That being said, here’s a distribution of the rating scores our reviewers gave all of the chocolate bars:

df1$Rating <- as.double(df1$Rating)
ggplot(data = df1) + geom_histogram(mapping = aes(x = Rating))

And a quantitative summary:

summary(df1$Rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.875   3.250   3.186   3.500   5.000

From our histogram and a quick summary of our ratings, it appears that the ratings are fairly unbaised. From the description of how the rating system operates, a rating of 3 is a “Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities).”

So, we can (probably) trust our reviewers. Though if you happen to like Walmart chocolate, by all means take this with a grain of salt.

Let’s start by tackling the number one every so-called “foodie” seems to say about chocolate: imported chocolate is simply “better” than the chocolate we can find here in the United States. Here’s a bar graph comparing where chocolate bars were manufactured vs. their rating.

Because our data set is rather large, we’ll trim down the countries to the 5 that have the greatest number of data points.

head(sort(table(df1$Company.Location), decreasing = T))
## 
##  U.S.A.  France  Canada    U.K.   Italy Ecuador 
##     764     156     125      96      63      54

We see that the top 5 countries represented are: the U.S.A, France, Canada, the U.K., and Italy. From this, we’ll filter out our data and plot:

df2 <- df1 %>% filter(Company.Location %in% c("U.S.A.", "France", "Canada", "U.K.", "Italy"))
df2$Rating <- as.double(df2$Rating)
ggplot(data = df2, mapping = aes(x = Company.Location, y = Rating)) + geom_boxplot()

So it looks like the ratings are all very close! I guess we can now feel a little bit better knowing that our chocolate can stand up to the likes of our European competitors. It does appear, however, that the top quartile of Italy’s chocolate stands a bit higher than the rest.

Let’s try a different graph. Instead of a boxplot, a violin plot might do better at telling us the distribution of chocolate ratings.

df2 <- df1 %>% filter(Company.Location %in% c("U.S.A.", "France", "Canada", "U.K.", "Italy"))
df2$Rating <- as.double(df2$Rating)
ggplot(data = df2, mapping = aes(x = Company.Location, y = Rating)) + geom_violin()

Now this graph is a bit more telling. From the looks of the graph, it appears that Canada actually has largest proportion of above average chocolates. If you were to buy your chocolate from Italy, yes you might find some incredible chocolates, but you also run the risk of getting some pretty horrid chocolates as well.

Maybe a better way to see a significant difference in the quality of the bars would be to compare the countries from which the actual cacao beans were sourced. So we’ll first have to do a little trimming:

head(sort(table(df1$Broad.Bean.Origin), decreasing = T))
## 
##          Venezuela            Ecuador               Peru 
##                214                193                165 
##         Madagascar Dominican Republic                    
##                145                141                 73

Here’s a similar data plot, but instead of the top 5 represented countries of manufacturing, we have the top 5 represented source countries of the beans.

df2 <- df1 %>% filter(Broad.Bean.Origin %in% c("Venezuela", "Ecuador", "Peru", "Madagascar", "Dominican Republic"))
df2$Rating <- as.double(df2$Rating)
ggplot(data = df2, mapping = aes(x = Broad.Bean.Origin, y = Rating)) + geom_boxplot()

From this graph, it looks like there really is not much of a difference in terms of where the beans are found/processed and their overall rating. You’re going to get fairly decent chocolate regardless!

Something else that might be of interest. Let’s go back to the “foodies”. There’s an almost cultish appeal to dark chocolate. Many people dislike it, but many other food enthusiasts seem to claim that dark chocolate simply has this “complexity” that you can’t find in lighter chocolates.

So let’s see how cacao percentage stacks up to overall rating.

df2 <- df1 %>% select(Cocoa.Percent, Rating)
df2$Rating <- as.double(df2$Rating)
df2$Cocoa.Percent <- as.numeric(sub("%", "", df2$Cocoa.Percent))
ggplot(data = df2) + geom_point(mapping = aes(x = Rating, y = Cocoa.Percent))

Once again, no real distinctions. It appears that the judges really have no predilection for dark chocolate, where it comes from, or where it was made.

Conclusion

From our analysis, it seems that we have debunked quite a couple myths about chocolate. Firstly, it seems that wherever you go, you’re going to find some pretty good chocolates, and some pretty bad ones. This was honestly a bit surprising to me, as I seem to have a slight disposition for fancy chocolates found in Europe. Perhaps it’s placebo after all. Also incredibly surprising was that there appears to be no real distinction between the country from which the cacao beans were imported and the quality of the chocolate bar. This goes to show that there’s more to chocolate making than just where the chocolate comes from; every step of the process matters!

I must admit, ultimately, that this analysis was a bit useless. I shamelessly admit that I am partly the “foodie” I’ve referenced throughout this analysis- and am very happy to be proven wrong. My friend and I have many arguments about food, but what it ultimately comes down to is: do I like this, or do I dislike this? And whatever chocolate you may like, that’s for you to decide!