Have you ever pondered whether a muffin is really a breakfast food and not just an excuse to eat cake first thing in the morning? Well, you’ve come to the right blog post! In a previous post, I explained how I created a dataset of the ingredients of 269 cupcake and muffin recipes. In this installment, I’m going to use that dataset to demonstrate some of the important properties of multivariate statistics, specifically the difference between principal component analysis (PCA) and partial least squares regression (PLS).
The data and code to repeat this analysis is available on GitHub. This is by no-means a complete analysis of this dataset and I encourage others to use it. I think the concept of recipes as observations and ingredients as variables is a helpful metaphor for multivariate statistics in general.
Multivariate data means data with many things measured on the same samples or observations. In this example, recipes are the observations and the variables are the ingredients measured in US cups per serving. One common problem associated with multivariate data is that usually many of the variables are correlated.
For example, baking powder, salt, baking soda, oil, milk, spice, and fruit are all strongly correlated with each other. This is called “multicollinearity”. Multicollinearity causes problems for statistical techniques that assume variables are independent, like multiple regression.
Other common difficulties presented by ecological multivariate data include the “curse of dimensionality” (more variables than observations), and missing values.
Unsupervised and Supervised analyses
Principal component analysis (PCA) is a multivariate technique that aims to explain the variation in the ingredient amounts, but is unsupervised. That is, it’s totally agnostic to whether recipes are muffins or cupcakes. Imagine a cloud of points in 3D space. PCA is aiming to draw a line through the spread of that cloud of points. That line explains most of the variation in the data. That line is then rotated and called “principal component 1”. Perpendicular to that, principal component 2 is drawn to explain the second greatest amount of variation in the points. You could then project your points onto this new coordinate space and do some statistical test to determine if your groups (e.g. cupcake or muffin) are different along one or both of these principal components.
Ecologists use unsupervised analyses like PCA all the time, for example to reduce the complexity or “dimensionality” of multivariate datasets like community composition or traits of organisms. But this strategy does not tell you if cupcakes are different from muffins. It tells you: 1) what ingredients vary the most among all cupcake and muffin recipes, and 2) do cupcakes and muffins differ in the amounts of those ingredients, which isn’t exactly the question we are trying to answer.
Partial least squares regression (PLS) and its discriminant analysis extension (PLS-DA) are supervised multivariate statistical techniques. That is, PLS knows about the Y variable (type of recipe) and instead of making a line through the spread in that cloud of points, PLS draws a line that explains the difference between cupcakes and muffins. This actually answers the question “are muffins and cupcakes different?” and tells you which ingredients are most responsible for that difference.
To date, supervised analyses like PLS are uncommon in ecology, even though this may often be the kind of question ecologists want to answer. Additionally, PLS is built to handle multicollinearity, the curse of dimensionality, and missing values, which makes it an excellent tool for analyzing ecological data!
For this blog post, I’m using a subset of the dataset with all frosting ingredients removed (because obviously cupcakes have frosting and muffins don’t). The reason I’m using a subset of only 30 recipes is to more accurately replicate the “curse of dimensionality” that is common in ecological data.
nofrosting.raw <- read_rds(here("data", "nofrosting_wide.rds")) #can be found at github.com/Aariq/cupcakes-vs-muffins set.seed(888) nofrosting <- nofrosting.raw %>% sample_n(30) %>% #puts factor names in title case for prettier plots mutate(type = fct_relabel(type, tools::toTitleCase)) nofrosting
## # A tibble: 30 x 42 ## type recipe_id agave `baking powder` `baking soda` bran butter ## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Cupc… 145206 0 0.00174 0.000868 0 0 ## 2 Cupc… 240140 0 0.000868 0.000694 0 0.0167 ## 3 Cupc… 161019 0 0.000868 0.00174 0 0 ## 4 Muff… 16945 0 0 0 0 0.0741 ## 5 Muff… 228562 0 0.00116 0.00231 0 0.00694 ## 6 Cupc… 215375 0 0.00130 0.000651 0 0 ## 7 Cupc… 242474 0 0.00208 0 0 0 ## 8 Cupc… 233538 0 0.00312 0.000521 0 0.05 ## 9 Cupc… 155534 0 0.00231 0.000579 0 0 ## 10 Muff… 6753 0 0.00694 0 0 0 ## # … with 20 more rows, and 35 more variables: buttermilk <dbl>, ## # cheese <dbl>, chocolate <dbl>, cornmeal <dbl>, cream <dbl>, `cream ## # cheese` <dbl>, eggs <dbl>, flour <dbl>, frosting <dbl>, fruit <dbl>, ## # `fruit juice` <dbl>, honey <dbl>, `low-cal sweetener` <dbl>, ## # margarine <dbl>, mayonnaise <dbl>, milk <dbl>, molasses <dbl>, ## # nut <dbl>, oats <dbl>, oil <dbl>, other <dbl>, salt <dbl>, ## # shortening <dbl>, `sour cream` <dbl>, spice <dbl>, starch <dbl>, ## # sugar <dbl>, syrup <dbl>, unitless <dbl>, vanilla <dbl>, ## # vegetable <dbl>, vinegar <dbl>, water <dbl>, `wheat germ` <dbl>, ## # yogurt <dbl>
I’ll be using the
ropls package to do both PCA and PLS-DA. See the documentation for that package for more info on how to use it.
PCA: What ingredients vary the most among all recipes combined?
PCA, an unsupervised analysis, answers the question “what ingredients vary among all muffin and cupcake recipes?”
baked.pca <- opls( dplyr::select(nofrosting, -type, -recipe_id), #the data plotL = FALSE #suppresses default plot )
## PCA ## 30 samples x 29 variables ## standard scaling of predictors ## 11 excluded variables (near zero variance) ## R2X(cum) pre ort ## Total 0.513 5 0
A few ingredients get dropped because none of the recipes in my random sample of 30 have those ingredients. Notice that “type” is excluded in the PCA. PCA is totally agnostic to whether a recipe is for muffins or cupcakes.
Principal component 1 (PC1) represents a spectrum of leavening system. PC1 is negatively correlated with baking soda and some acidic ingredients like yogurt, sour cream, and cream cheese. PC1 is positively correlated with baking powder and milk. If you’re a baker, this makes sense because baking powder is just baking soda plus some powdered acid. If you have an acidic batter, then you can use baking soda.
Principal component 2 is a “healthiness” axis going from savory/healthy at the top to sweet/unhealthy at the bottom.
There is no separation between muffins and cupcakes along PC1 (leavening system) even though that’s where the most variation is. There is slight separation along the healthiness axis with muffins tending to be a little more healthy than cupcakes.
BUT this doesn’t answer the question of whether cupcakes and muffins are different. It answers a slightly different question: “Do cupcakes and muffins differ in the ingredients that vary the most among all the recipes combined?”
PLS-DA: Are cupcakes different from muffins?
PLS-DA looks for a combination of ingredients that best explains categorization as cupcake or muffin. For this dataset the
opls() function finds a single significant predictive axis. For the sake of plotting something, I ask it to do orthogonal PLS-DA, which creates a second axis that represents variation not related to the type of baked good.
baked.plsda <- opls( dplyr::select(nofrosting, -type, -recipe_id), #X data nofrosting$type, #Y data plotL = FALSE, #suppresses default plotting predI = 1, #make one predictive axis orthoI = 1, #and one orthogonal axis permI = 200) #use 200 permutations to generate a p-value
## OPLS-DA ## 30 samples x 29 variables and 1 response ## standard scaling of predictors and response(s) ## 11 excluded variables (near zero variance) ## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2 ## Total 0.189 0.832 0.58 0.214 1 1 0.005 0.005
This output gives us some important properties of the model.
R2X(cum) is the proportion of variation in the data explained by the predictive axes.
R2Y(cum), on the other hand, is the proportion of variation in baked good type explained by the model. PLS-DA only explains 9.44% of total variation, but explains 83% of the difference between cupcakes and muffins!
Q2(cum) is calculated through cross-validation and can be thought of as the predictive power of the model.
Q2(cum) is always smaller than
R2Y(cum), but the larger it is, and the closer it is to
R2Y(cum), the better. A large \(Q^2\) value indicates strong predictive power.
RMSEE is the root mean squared error of estimation, a measure of error in the same units as the Y variable, which is not super useful in this case since our Y variable is categorical.
ort are just how many predictive and orthogonal components were used. Finally, the two p-values are generated through permutation—the data labels (muffin or cupcake) are shuffled randomly and the PLS-DA is re-fit. These p-values are the proportion of those 200 random datasets that generate \(R^2_Y\) and \(Q^2\) values as good or better than the real data.
So, we can conclude that cupcakes are different than muffins (p < 0.005)!
Let’s see what ingredients contribute most to this difference.
Clearly, the more vanilla there is in a recipe, the more likely it is to be a cupcake. Conversely, the more fruit, flour and salt there is in a recipe, the more likely it is to be a muffin.
Use the right tools for the job!
PCA and PLS-DA give different results because they are answering different questions. In this case, the ingredients that vary the most among baked goods are not the same variables that best distinguish muffins from cupcakes. If you want to know what ingredients vary the most among all the recipes, use an unsupervised analysis like PCA. If you want to know what makes cupcakes different from muffins, use a supervised analysis like PLS-DA
In ecology, we often measure multiple traits of organisms and expect high levels of variation among individuals in a population. The most highly variable traits are not necessarily ones that correlate with some Y variable such as elevation, genotype, or some experimental treatment imposed by researchers. Therefore, it doesn’t make sense to expect PCA to find relationships with that Y variable. If you’re asking a question about multivariate relationships to some Y variable (e.g. how plant metabolites change with elevation), it makes sense to use PLS.
Thanks to Elizabeth Crone for comments on a draft of this post and for encouraging me to do serious science using muffin and cupcake recipes!