DLNMs: building and visualizing crossbasis functions
The dlnm package
dlnm package offers two ways of fitting crossbasis functions: an “internal” and an “external” method. The “external” method involves fitting the crossbasis function outside of a model, using some functions in the
dlnm package, then including the results as a predictor in a model such as a generalized linear model (GLM). I’m going to focus entirely on the “internal” method that fits the crossbasis function in the context of a generalied additive model (GAM) to take advantage of the penalization and other stuff the
mgcv package offers.
Throughout this series, I’m going to use a subset of data from my postdoc project on Heliconia acuminata. In this subset, 100 plants were tracked over a decade. Every year in February their size was recorded as height and number of shoots, and it was recorded whether or not they flowered. Any dead plants were marked as such. The goal is to determine how drought impacted growth, survival, and flowering probability with a potentially delayed and/or non-linear relationship. To that end, I’ve calculated SPEI, a measure of drought, where more negative numbers represent more severe drought. SPEI is monthly while the demography data is yearly. For every observation of a plant, there is an entire history of SPEI for the past 36 months from that observation.
## # A tibble: 6 x 11 ## plot ha_id_number year size size_next log_size log_size_next flwr surv ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2107 107 1998 66 112 4.19 4.72 0 1 ## 2 2107 107 1999 112 102 4.72 4.62 0 1 ## 3 2107 107 2000 102 68 4.62 4.22 0 1 ## 4 2107 107 2001 68 96 4.22 4.56 0 1 ## 5 2107 107 2002 96 164 4.56 5.10 0 1 ## 6 2107 107 2003 164 114 5.10 4.74 0 1 ## # … with 74 more variables: spei_history[,1] <dbl>, [,2] <dbl>, [,3] <dbl>, ## # [,4] <dbl>, [,5] <dbl>, [,6] <dbl>, [,7] <dbl>, [,8] <dbl>, [,9] <dbl>, ## # [,10] <dbl>, [,11] <dbl>, [,12] <dbl>, [,13] <dbl>, [,14] <dbl>, ## # [,15] <dbl>, [,16] <dbl>, [,17] <dbl>, [,18] <dbl>, [,19] <dbl>, ## # [,20] <dbl>, [,21] <dbl>, [,22] <dbl>, [,23] <dbl>, [,24] <dbl>, ## # [,25] <dbl>, [,26] <dbl>, [,27] <dbl>, [,28] <dbl>, [,29] <dbl>, ## # [,30] <dbl>, [,31] <dbl>, [,32] <dbl>, [,33] <dbl>, [,34] <dbl>, ## # [,35] <dbl>, [,36] <dbl>, [,37] <dbl>, L[,1] <int>, [,2] <int>, [,3] <int>, ## # [,4] <int>, [,5] <int>, [,6] <int>, [,7] <int>, [,8] <int>, [,9] <int>, ## # [,10] <int>, [,11] <int>, [,12] <int>, [,13] <int>, [,14] <int>, ## # [,15] <int>, [,16] <int>, [,17] <int>, [,18] <int>, [,19] <int>, ## # [,20] <int>, [,21] <int>, [,22] <int>, [,23] <int>, [,24] <int>, ## # [,25] <int>, [,26] <int>, [,27] <int>, [,28] <int>, [,29] <int>, ## # [,30] <int>, [,31] <int>, [,32] <int>, [,33] <int>, [,34] <int>, ## # [,35] <int>, [,36] <int>, [,37] <int>
plot(factor): A plot ID
ha_id_number(factor): A unique plant ID
year(numeric): year of census
size(numeric): number of shoots x height in cm
size_next(numeric): size in the next year
log_size(numeric): log transformed size
log_size_next(numeric): log transformed size next year
flwr(numeric): Did a plant flower? 1 = yes, 0 = no
surv(numeric): Did a plant survive? 1 = yes, 0 = no
spei_history(c(“matrix”, “array”)): A matrix column of the drought history starting in the current month (
spei_history[,1]= February) and going back 24 months (
spei_history[,25]= February 2 years ago)
L(c(“matrix”, “array”)): A matrix column describing the lag structure of
spei_history. Literally just
0:24for every row.
Fit a DLNM
library(mgcv) #for gam() library(dlnm) #for the "cb" basis
growth <- gam(log_size_next ~ log_size + s(spei_history, L, # <- the two dimensions bs = "cb", # <- fit as crossbasis k = c(3, 24), # <- knots for each dimension xt = list(bs = "cr")), # <- what basis to use for each dimension family = gaussian(link = "identity"), method = "REML", data = ha)
Above is a simple DLNM with survival modeled as a function of number of shoots and the crossbasis function of SPEI over the past 36 months.
shts is a fixed effect (i.e. not a smooth, but to be fit as a straight line), and the crossbasis is defined in
s(spei_history, L, …).
L are the two dimensions of the crossbasis function,
bs = "cb" tells
gam() that this is a crossbasis function from the
dlnm package (calls
dlnm::smooth.construct.cb.smooth.spec behind the scenes).
xt = list(bs = "cr") tells it to use a cubic regression spline as the basis for both dimensions of the crossbasis function (but you can also mix and match marginal basis functions by providing a length 2 vector here).
Problem 1: visualizing the results
plot.gam() does not work with these crossbasis functions.
## Error in plot.gam(growth): No terms to plot - nothing for plot.gam() to do.
dlnm package provides some functions for visualizing the results of a DLNM, though I don’t like them much.
First you use
crosspred() to get predicted values for the DLNM.
pred_dat <- crosspred("spei_history", growth)
## centering value unspecified. Automatically set to 0
Then you plot those with
plot.crosspred(). The default is a 3D plot.
I prefer a heatmap, although the one produced here has some issues.
plot(pred_dat, ptype = "contour", xlab = "SPEI", ylab = "lag(months)")
First obvious problem is the colors. The range is the same for red and blue, despite different number of breaks. Second, the units are not what I’d expect. For a marginal effects plot these should be the size of an average plant in year t+1, all else being equal. This is plotting the size relative to the size at an average value of SPEI, which is a weird thing to think about. That’s because the package was built with epidemiology and relative risk in mind. Here is the plot relative to SPEI = 1.5
pred_dat <- crosspred("spei_history", growth, cen = 1.5) plot(pred_dat, ptype = "contour", xlab = "SPEI", ylab = "lag(months)")
## Solution (?)
So, I spent a lot of time writing a complicated function,
cb_margeff(), to create data for a marginal effects plot. It creates a
newdata data frame to be passed to
predict() and loops across different matrixes with all columns of
spei_history set to average except for one, representing a range of possible SPEI values.
plotdata <- cb_margeff(growth, spei_history, L) ggplot(plotdata, aes(x = x, y = lag, fill = fitted)) + geom_raster() + scale_fill_viridis_c("size in year t+1", option = "A") + scale_x_continuous("SPEI", expand = c(0,0)) + scale_y_continuous("lag (months)", expand = c(0,0))
Yeah, this is looking better.
The interpretation of this type of plot (which I would describe as a marginal effects plot, but correct me if I’m wrong) makes more sense to me. If there was drought (low SPEI) about 8 months prior to the census, that’s bad for growth. Drought 20 months prior is good for growth though.
I poked around in
debug() and it turns out the reason the plotting doesn’t work is only because the author of
dlnm, Gasparrini, didn’t want it to work.
I can change a simple flag inside the
growth model, and then it produces something very similar (identical?) to what I have above:
##  FALSE
growth$smooth[]$plot.me <- TRUE plot.gam(growth, scheme = 2)
Why is this default plot not available? It’s literally exactly what I wanted, and I’m pretty sure there’s nothing incorrect about it, but it worries me that the author of
dlnm didn’t want me to make it.
I think I better understand what is going on here now.
plot.gam(), and the
ggplot2 implementation of it,
gratia::draw(), plot the smooth itself, not the predicted values. By “the smooth itself”, I mean the function that is acting sort of like one of the coefficients in a GLM. Instead of \(y = \beta_0 + \beta_1 x_1\), we have \(y = \beta_0 + f_1(x_1)\). To further clarify, look at the options for
predict.gam(). To get predicted \(y\) values, you can use
type = "link" or
type = "response". But if you just want the values for \(f_1(x_1)\), then you can use
type = "terms". The plot above looks like the one I want, but the scale is actually not in units of plant size. See the
gratia version, which includes a scale bar:
gratia::draw(growth, select = 1)
## Warning: Removed 423 rows containing non-finite values (stat_contour).
So my efforts in creating
cb_margeff() weren’t for nothing, afterall, and are not in conflict with the views of the
dlnm package authors. Some day I should probably figure out how to “manually” calculate values of \(y\) from the GAM coefficients, but today is not that day.