Taking a potentially continuous treatment, binning it into categories, and doing ANOVA results in reduced statistical power and complicated interpretation. Yet, as a graduate student, I was advised to bin continuous treatment variables into categories multiple times by different people. Why? I suspect it’s because ANOVA is what ecologists are most familiar with, and the alternative, a quadratic regression, sounds complicated. A regression also doesn’t let you draw a bar plot with letters over the bars that ecologists really seem to love. Hopefully this example with simulated data will convince you to consider regression over ANOVA1 when your treatment can be considered continuous (and you suspect a continuous relationship with the response).

I’ll work through an example with simulated data to show you what I mean. Let’s say you’ve applied fertilizer at 3 different levels to 15 replicate corn fields (5 fields per fertilizer treatment). The treatments are 100, 200, and 300 kg N / ha. We measure yield and standardize it to percent of maximum yield.

I’m going to analyze this both as an ANOVA type design, treating fertilizer as categorical, and as a regression. For the sake of demonstration, I’ll use post-hoc power analysis to get statistical power for each test (something you probably shouldn’t do in practice because post-hoc power is fixed once you compute a p-value).

ANOVA

Here’s the ANOVA model in R:

m <- aov(yield ~ fert_factor, data = df)
## Analysis of Variance Table
##
## Response: yield
##             Df Sum Sq Mean Sq F value  Pr(>F)
## fert_factor  2 333.19 166.595  5.0086 0.02621 *
## Residuals   12 399.14  33.262
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the ANOVA, there is a significant effect of fertilizer on yield (p = 0.026)

Our results look like this:

## summarise() ungrouping output (override with .groups argument)

Our sample size, $$n$$, is 5. Statistical power, $$\beta$$, is 0.39.2

Regression

But why not treat those concentrations as a continuous variable and instead fit a quadratic regression? A quadratic regression fits a line described by a quadratic function (a curve) through the relationship between fertilizer concentration and growth. Here’s what this model looks like:

$yield = \beta_0 + \beta_1fert + \beta_2fert^2$

This method is flexible. The relationship could be concave, convex, or increasing with a varying slope. If the true relationship is linear, then $$\beta_2$$ will be zero, and we’ll be left with the equation for a line.

There are two ways to write this model as R code. This first form is useful because the default behavior of anova() gives a single p-value for the effect of fertilizer.

m1a <- lm(yield ~ poly(fert, 2, raw = TRUE), data = df)
anova(m1a)
## Analysis of Variance Table
##
## Response: yield
##                           Df Sum Sq Mean Sq F value  Pr(>F)
## poly(fert, 2, raw = TRUE)  2 333.19 166.595  5.0086 0.02621 *
## Residuals                 12 399.14  33.262
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This second form is useful because it tells you if the quadratic term is significant (if it’s not, you might try just fitting a straight line). I() means “literally multiply, don’t fit an interaction term”.

m1b <- lm(yield ~ fert + I(fert * fert), data = df)
anova(m1b)
## Analysis of Variance Table
##
## Response: yield
##                Df Sum Sq Mean Sq F value  Pr(>F)
## fert            1 141.80 141.805  4.2633 0.06125 .
## I(fert * fert)  1 191.39 191.385  5.7539 0.03360 *
## Residuals      12 399.14  33.262
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Either way, there is still a significant effect of fertilizer on yield.

Now $$n = 15$$, and our power, $$\beta$$, has gone up to 0.8. The power is doubled compared to the ANOVA design because of the greater effective sample size in the regression model.

Better yet, measure the treatment

But wait, there’s more! I haven’t told you something about the data I simulated. I generated data so that growth has a quadratic response to nitrogen concentration in the soil, but soil nitrogen isn’t perfectly correlated with the nitrogen applied. Your intended treatment is rarely what a plant is actually experiencing. So let’s say we can do even better than including the intended treatment as a continuous variable—let’s get the soil tested for nitrogen content and use that as an independent variable.

m2 <- lm(yield ~ true + I(true * true), data = df)
anova(m2)
## Analysis of Variance Table
##
## Response: yield
##                Df Sum Sq Mean Sq F value   Pr(>F)
## true            1 219.48 219.479  10.186 0.007754 **
## I(true * true)  1 254.28 254.278  11.801 0.004938 **
## Residuals      12 258.57  21.548
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now our power, $$\beta$$, is 0.99. It’s probably very often worth it to try to measure whatever latent variable that mediates the effect of your treatment. In fact, that increased spread of your data is a good thing if you want to better describe the shape of the relationship between treatment and response.

1. Well, technically ANOVA is a regression↩︎

2. Statistical power was estimated using the pwr package. Effect size was calculated using Cohen’s suggestions.↩︎

Postdoctoral Researcher

I’m a postdoctoral researcher in Emilio Bruna’s lab at University of Florida working on the effects of drought and habitat fragmentation on a tropical plant. I’m interested in the mechanisms of plant responses to stress and their consequences for natural and agricultural ecosystems.