Three ways to plot logistic regressions
If your data is just 1’s and 0’s, it can be difficult to visualize alongside a best-fit line from a logistic regression.
Even with transparency, the overplotted data points just turn into a smear on the top and bottom of your plot, adding little information. Here are three ways to get more information out of those points and produce more informative plots. But first, a quick introduction to the data.
I simulated some data on survival as a function of size. Survival is binary (1 = survived, 0 = died).
## # A tibble: 6 x 2 ## size surv ## <dbl> <int> ## 1 4.78 0 ## 2 4.40 1 ## 3 5.02 1 ## 4 5.32 1 ## 5 4.61 0 ## 6 4.81 1
##  1000
We can fit a logistic regression…
m <- glm(surv ~ size, family = binomial, data = df)
…and extract fitted values using
plot_df <- augment(m, type.predict = "response") head(plot_df)
## # A tibble: 6 x 8 ## surv size .fitted .resid .std.resid .hat .sigma .cooksd ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 4.78 0.746 -1.66 -1.66 0.00125 0.970 0.00184 ## 2 1 4.40 0.601 1.01 1.01 0.00287 0.971 0.000958 ## 3 1 5.02 0.819 0.633 0.633 0.00122 0.972 0.000136 ## 4 1 5.32 0.884 0.496 0.496 0.00160 0.972 0.000105 ## 5 0 4.61 0.685 -1.52 -1.52 0.00169 0.971 0.00184 ## 6 1 4.81 0.756 0.748 0.748 0.00122 0.971 0.000197
These are the data I used for the plot above with the points corresponding to
surv and the best-fit line corresponding to
base <- ggplot(plotdf, aes(x = size)) + geom_line(aes(y = .fitted), color = "blue") + labs(x = "Size", y = "Survival") base + geom_point(aes(y = surv), alpha = 0.2)
1. Rug plot
Turning those points into a “rug” is a common way of dealing with overplotting in logistic regression plots.
geom_rug(), but getting that rug to correspond to dead plants on the bottom and live plants on the top requires a little data manipulation. First, we’ll create separate columns for dead and alive plants where the values of size only if the plant is dead or alive, respectively, and otherwise
plot_df <- plot_df %>% mutate(survived = ifelse(surv == 1, size, NA), died = ifelse(surv == 0, size, NA))
Then, we can plot these as separate layers.
base <- ggplot(plot_df, aes(x = size)) + geom_line(aes(y = .fitted), color = "blue") + labs(x = "Size", y = "Survival") base + geom_rug(aes(x = died), sides = "b", alpha = 0.2) + geom_rug(aes(x = survived), sides = "t", alpha = 0.2)
Honestly, this is not a huge improvment. The overplotting is less of an issue and you can start to see the density of points a bit better, but it’s still not great.
2. Binned points
I discovered this plot in Data-driven Modeling of Structured Populations by Ellner, Childs, and Rees. Their plot used base R graphics, but I’ll use
stat_summary_bin() to get a mean survival value for binned size classes and plot those as points.
base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv))
I think this is fabulous! It definitely needs an explanation in a figure caption though, because what those points represent is not immediately obvious. Also, how close the points fit to the line has more to do with bin size than with model fit, so this one might be better for inspecting patterns than for evaluating fit.
base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv)) + labs(title = "bins = 30") | base + stat_summary_bin(geom = "point", fun = mean, aes(y = surv), bins = 60) + labs(title = "bins = 60")
This option takes the ideas of binning values from #2 and showing distributions in the margins from #1 and combines them. I discovered this in a paper from my postdoc adviser, Emilio Bruna.
A function to make this third type of plot with base R graphics is available in the
library(popbio) logi.hist.plot(size, surv, boxp = FALSE, type = "hist", col = "gray")
Re-creating this with ggplot2 requires some hacks, and I’m still not all the way there.
base + geom_histogram(aes(x = died, y = stat(count)/1000), bins = 30, na.rm = TRUE) + geom_histogram(aes(x = survived, y = -1*stat(count/1000)), bins = 30, na.rm = TRUE, position = position_nudge(y = 1))
There are at least two “hacks” going on here. First, I’m using
stat() to extract the bar heights automatically calculated by
geom_histogram() to scale the histogram down. Second, to get the histogram for survivors to be at the top I need to flip it upside down (by multiplying by -1) and move it to the top of the plot with
position_nudge(). The downside to this plot is that there are technically three y-axes—the survival probability and the number or proportion in each size class for dead and alive individuals (with 0 at the bottom and top, respectively). You can add a second y-axis to a ggplot, but I’m not sure about a third y-axis.
If you know of another cool way to visualize logistic regressions, or know of some package that does all this for you, please let me know in the comments!