Using the right tool for the job: the difference between unsupervised and supervised analyses of multivariate ecological data


Ecologists often collect data with the aim of determining which of many variables are associated with a particular cause or consequence. Unsupervised analyses (e.g. principal components analysis, PCA) summarize variation in the data, without regard to the response. Supervised analyses (e.g., partial least squares, PLS) evaluate the variables to find the combination that best explain a causal relationship. These approaches are not interchangeable, especially when the variables most responsible for a causal relationship are not the greatest source of overall variation in the data—a situation that ecologists are likely to encounter. To illustrate the differences between unsupervised and supervised techniques, we analyze a published dataset using both PCA and PLS and compare the questions and answers associated with each method. We also use simulated datasets representing situations that further illustrate differences between unsupervised and supervised analyses. For simulated data with many correlated variables that were unrelated to the response, PLS was better than PCA at identifying which variables were associated with the response. There are many applications for both unsupervised and supervised approaches in ecology. However, PCA is currently overused, at least in part because supervised approaches, such as PLS, are less familiar.


Twitter thread:

comments powered by Disqus