Linear multilevel models

Linear multilevel (or mixed) models are useful when there are dependencies (hierarchies) in the data:

Example 1: Measuring the impact of a new treatment on patient outcomes provided by different doctors/clinics. Patient level observations are not independent, as within a given doctor patient outcomes are more similar. But across doctors there is also a treatment effect.

Example 2: Determining what features predict admittance into college. Predictors include a student’s high school GPA, extracurricular activities, and SAT scores. Predictors also include whether the school attended was public or private, current student-to-teacher rations, and the school’s rank against other schools. So we have student-level and school-level predictors, and students are nested within schools.

Example 3: A psychologist measures changes in cognitive performance over time (i.e., learning) among individuals, and is interested in whether there are differences among individuals (or groups of individuals). There are multiple test occassions (i = 1…I) for each individual, and multiple individuals (j = 1…n), in multiple groups (k = 1…K). So I (number of test occassions) can vary between J individuals, and the number of indiviudals in each group n can vary between K groups.


Other methods

  • Ordinary Least Squares (OLS) linear regression. This may ignore dependencies within the data
  • Summarise each cluster and perform regression on the averages. This underrepresents the variablility in the data (since the averages have variability around them)
  • OLS regression with clustered standard errors. These can adjust for non-independence but do not handle many multiple sources of clustering, and missing data is handled by listwise deletion (causing bias and reduced power)
  • Fixed effect models (with clustered standard errors). These control for the dependency but don’t let you explore the random effects (i.e., conditional means evaluated at the estimated parameters) or the variance
  • Repeated measures ANOVA (RM ANOVA). These handle the dependency (e.g., over time) but time is treated categorically (so time interval is ignored), and all observations must have the same number of repeats. Differences in change over time (slope) are handled by Greenhouse-Geisser adjustment, which is suboptimal.



Overview

Prof. Matthew Clapman from UC Santa Cruz provides a brief and approachable introduction to linear multilevel models:



Theory

The UCLA Office of Advanced Research Computing provides expert resources on statistical methods and data analytics. Below is a link to their introduction to linear multilevel models:



Simulation

To understand the model, we simulated an effect with known (i.e., true) parameters and compared the model results to the true parameters.

The effect of student grades on their subsequent starting salary after graduation was simulated. Students were clustered by degree/major (e.g., architecture, biology, computer science, dentistry, engineering).

We chose simulation parameters such that as the average grade in a major got higher, salary was worse. So easier majors paid worse. However the true effect of grade on salary within each major was 2.


  • Intercepts for each major \(\sim N(mean = -10 \times {grade}, \ sd = 6)\)
  • Slopes for each major \(\sim N(mean = 2, sd = 0.1)\)
  • Easier majors get paid worse
  • True effect of grade on salary is 2
sim <- function(N = 100, g = 5, mu = c(0, 2), sigma = c(1, 1, 1)) {
  
  require(purrr)
  require(dplyr)
  
  n = round(N / g)
  
  N = n * g
  
  purrr::cross_df(
    list(ids = 1:n, major = 1:g)
  ) %>%
    dplyr::mutate(
      grade = rnorm(N),
      major = as.factor(letters[major])
    ) %>%
    dplyr::group_by(major) %>%
    dplyr::mutate(
      average_grade = mean(grade),
      intercept = rnorm(1, mu[1]*average_grade, sigma[1]),
      slope = rnorm(1, mu[2], sigma[2]),
      salary = rnorm(n, intercept + slope*grade, sigma[3])
    ) %>%
    ungroup()
}

set.seed(5) # 5, mu = -1, 2, sigma = 6, .3, 3
sim_data <- sim(N = 25, mu = c(-10, 2), sigma = c(6, 0.1, 3))
contrasts(sim_data$major) <- contr.sum


Description of the data

We have unique IDs for each person (ids), as well as the major, grade and salary for each person. We also have the average grade for each major, and the true intercept and true slope for each major.

head(sim_data, 20)
## # A tibble: 20 × 7
##      ids major   grade average_grade intercept slope  salary
##    <int> <fct>   <dbl>         <dbl>     <dbl> <dbl>   <dbl>
##  1     1 a     -0.841          0.214     -3.90  2.03  -2.75 
##  2     2 a      1.38           0.214     -3.90  2.03  -4.12 
##  3     3 a     -1.26           0.214     -3.90  2.03 -12.5  
##  4     4 a      0.0701         0.214     -3.90  2.03  -9.04 
##  5     5 a      1.71           0.214     -3.90  2.03  -0.851
##  6     1 b     -0.603         -0.372     12.2   2.11  15.6  
##  7     2 b     -0.472         -0.372     12.2   2.11   8.82 
##  8     3 b     -0.635         -0.372     12.2   2.11  10.7  
##  9     4 b     -0.286         -0.372     12.2   2.11  17.3  
## 10     5 b      0.138         -0.372     12.2   2.11  11.1  
## 11     1 c      1.23          -0.377     12.8   2.22  17.2  
## 12     2 c     -0.802         -0.377     12.8   2.22   8.32 
## 13     3 c     -1.08          -0.377     12.8   2.22   8.98 
## 14     4 c     -0.158         -0.377     12.8   2.22  10.2  
## 15     5 c     -1.07          -0.377     12.8   2.22  10.2  
## 16     1 d     -0.139         -0.588      1.94  2.12   6.03 
## 17     2 d     -0.597         -0.588      1.94  2.12   1.23 
## 18     3 d     -2.18          -0.588      1.94  2.12   0.367
## 19     4 d      0.241         -0.588      1.94  2.12   0.671
## 20     5 d     -0.259         -0.588      1.94  2.12   1.05



Complete pooling model: lm()

Ignores major.

$$ \[\begin{align} salary_i &= \beta_0 + \beta_1 grade_i + \epsilon_i \\ \\ \epsilon_i &\sim N(0, \sigma) \end{align}\] $$

For \(i = 1...25\)

Model has 3 parameters:

  • \(\beta_0\) (intercept)
  • \(\beta_1\) (slope)
  • \(\sigma\) (variance of error)

fit <- lm(formula = salary ~ 1 + grade, 
          data = sim_data)


summary(fit)
## 
## Call:
## lm(formula = salary ~ 1 + grade, data = sim_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3119  -8.3384   0.2771   5.8256  20.3858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1.273      1.997   0.638   0.5300  
## grade         -3.653      2.081  -1.755   0.0925 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.979 on 23 degrees of freedom
## Multiple R-squared:  0.1181, Adjusted R-squared:  0.07978 
## F-statistic: 3.081 on 1 and 23 DF,  p-value: 0.09254
  • Overall effect of grade on salary is negative


Unpooled model: lm() for each major

Major-specific models

   \(salary_i = \beta_0 + \beta_1 grade_i + \epsilon_i\)

For \(i = 1...5\)

lm(formula = salary ~ 1 + grade, 
   data = sim_data[sim_data$major=="a", ])  # for "a", "b", "c", "d", "e"
model/major intercept slope sigma p.value
a -6.35 2.38 4.18 0.23
b 12.58 -0.35 4.13 0.96
c 12.26 3.41 1.51 0.02
d 2.37 0.85 2.55 0.58
e -14.59 1.87 2.11 0.64
average 1.25 1.63 2.90 0.49


  • Treats majors as completely independent
  • Residual standard errors (\(\sigma\)) are much lower
  • On average, grade has a positive effect on salary (but no valid inferential estimate)



Interaction model


\[ salary_i = \beta_0 + \beta_1 grade_i + \beta_{2[j]} major_j + \beta_{3[j]} grade_i \times major_j + \epsilon_i \]

contrasts(sim_data$major) <- contr.sum

lm.full <- lm(salary ~ 1 + grade * major, data = sim_data)
term estimate std.error statistic p.value
(Intercept) 1.25 1.27 0.99 0.34
grade 1.63 1.52 1.07 0.30
sigma 3.09
  • slope and intercept match the unpooled model (in a balanced dataset)
  • inference is possible



Linear model vs Linear multilevel model

Linear model
  • all effects (intercepts & slopes) are fixed and non-random
  • All? No not all


\[ \begin{align} \text{Level 1}: y_i &= \beta_0 + \beta_1 x_i + \epsilon_i \\ \\ \\ \text{Level 2}: \epsilon_i &\sim N(0, \sigma) \end{align} \]


Linear multilevel model
  • observations [i] belong to subgroups [j]
  • subgroups [j] are explicitly treated as random draws from a larger population


\[ \begin{align} \text{Level 1}: y_{ij} &= \alpha_{j} + \beta_1 x_{i} + \epsilon_i \\ \alpha_{j} &= \beta_0 + \beta_{0j} \\ \\ \text{Level 2}: \beta_{0j} &\sim N(\beta_0, \ \tau) \\ \epsilon_i &\sim N(0, \ \sigma) \end{align} \]

lme4::lmer()



Loading the packages

First we will need to install and load the two packages we will use to run LMM in R.

If you do not have a package installed, run: install.packages(“packagename”)

install.packages("lme4")
install.packages("lmerTest")

We now load these packages into the working environment

library(lme4)
library(lmerTest)

lme4 stands for linear mixed-effects models (from Douglas Bates and Ben Bolker [@add reference]). It is the most common package used for running LMMs in R. The main functions are lmer for running models with continuous outcome variables and glmer for models with other kinds of outcome variables (e.g., poisson or binary or logistic regression).

Because the lme4 package does not produce p-values, it is common for researchers more accustomed to working with p-values to use the lmerTest package. lmerTest doesn’t have any functions that we will use but, when loaded, it overrides the standard output of lme4 in order to provide p-values, approximated by the Satterthwaite approach. Be aware that while these p-values are provided automagically, they are approximations and may be very poor in some scenarios (e.g., unbalanced or crossed designs).

Confidence intervals are provided very easily by the lme4 package using conf.int().



Random intercept models

\[ \begin{align} \text{Level 1}: y_{ij} &= \alpha_{j} + \beta_1 x_{i} + \epsilon_i \\ \alpha_{j} &= \beta_0 + \beta_{0j} \\ \\ \text{Level 2}: \beta_{0j} &\sim N(\beta_0, \ \tau) \\ \epsilon_i &\sim N(0, \ \sigma) \end{align} \]


  • \(\alpha_{j}\) indicates that each cluster \(j\) (e.g., major) is given a unique intercept (\(\beta_{0j}\)), drawn from a normal distribution centered on \(\beta_0\), the grand intercept, meaning there might be different mean scores for each cluster [j]
  • In addition to the residual standard deviation \(\sigma\), we are now estimating one more variance component \(\tau\), which is the standard deviation of the varying intercepts
  • \(\text{ICC} = \tau^2 / (\tau^2 + \sigma^2)\) will go to zero if there is no systematic variation explained by clustering (i.e., so a linear model will do just as well), and go to 1 if clusters perfectly explain error/residual variance (i.e., there is no variation within each cluster j)
lmm.ri <- lmer(
  formula = salary ~ 1 + grade + (1|major),
  data = sim_data
  )
  • In lmer the model is specied by the formula argument. As in most R model-fitting functions, this is the first argument
  • The model formula consists of two expressions separated by the ~ symbol
  • The expression on the left, typically the name of a variable, is evaluated as the response
  • The right-hand side consists of one or more terms separated by + symbols
  • A random-effects term consists of two expressions separated by the vertical bar, |, a symbol read as “given” or “by”
  • Typically, such terms are enclosed in parentheses
  • The expression on the right of the | is evaluated as a factor, which we call the grouping factor for that term


summary(lmm.ri)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ 1 + grade + (1 | major)
##    Data: sim_data
## 
## REML criterion at convergence: 137.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.2577 -0.5724 -0.1475  0.6408  1.8026 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  major    (Intercept) 143.03   11.960  
##  Residual               8.33    2.886  
## Number of obs: 25, groups:  major, 5
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)  
## (Intercept)   1.4512     5.3796  3.9624   0.270   0.8008  
## grade         2.0877     0.7435 19.2071   2.808   0.0112 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## grade 0.004

The estimated effect of grade on salary is:

  • \(\hat \beta_1\) = 2.088

(much closer to the “true” effect!)


We can quantify the benefit of including the random intercept by the ICC:

  • \(\tau^2\) = 143.031
  • \(\sigma^2\) = 8.33
  • \(ICC\) = 0.945



Random intercept + slope model

\[ \begin{align} \text{Level 1}: y_{ij} &= \alpha_{j} + \beta_{1j} x_{i} + \epsilon_i \\ \alpha_{j} &= \beta_0 + \beta_{0[j]} \\ \beta_{j} &= \beta_1 + \beta_{1[j]} \\ \\ \text{Level 2}: \beta_{0[j]} &\sim N(\beta_0, \ \tau) \\ \beta_{1[j]} &\sim N(\beta_1, \ \gamma) \\ \epsilon_i &\sim N(0, \ \sigma) \end{align} \]


  • In addition to \(\alpha_{j}\), \(\beta_{j}\) indicates that each cluster \(j\) (e.g., major) is given a unique slope (\(\beta_{1[j]}\)), drawn from a normal distribution centered on \(\beta_1\), the grand slope, meaning there might be different changes for each cluster [j]
  • In addition to the residual standard deviation \(\sigma\) and the intercept standard deviation \(\tau\), we are now estimating one more variance component \(\gamma\), which is the standard deviation of the varying slopes
lmm.full <- lmer(salary ~ 1 + grade + (1 + grade|major), 
            data = sim_data)

summary(lmm.full)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: salary ~ 1 + grade + (1 + grade | major)
##    Data: sim_data
## 
## REML criterion at convergence: 137.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.3735 -0.5334 -0.1515  0.7178  1.7200 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  major    (Intercept) 140.8177 11.8667      
##           grade         0.1208  0.3475  1.00
##  Residual               8.3145  2.8835      
## Number of obs: 25, groups:  major, 5
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)  
## (Intercept)   1.6042     5.3383  3.9464   0.301   0.7789  
## grade         2.1043     0.7572 12.0758   2.779   0.0166 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## grade 0.210 
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

The estimated effect of grade on salary is:

  • \(\hat \beta_1\) = 2.104

(close to the “true” effect!)


We can quantify the benefit of including the random slope by the ICC:

  • \(\gamma^2\) = 0.121
  • \(\sigma^2\) = 8.314
  • \(ICC\) = 0.015



Example of APA format results summary

A linear mixed-effects model was fitted to predict salary based on grade, with random intercepts and slopes for grade across major. The model included 25 observations across 5 majors. The model converged successfully (REML = 137.7), though a boundary (singular) fit was detected, suggesting potential overparameterization or perfect correlation among random effects.

The fixed effect of grade was statistically significant, indicating that higher grades were associated with higher salaries, b = 2.10, SE = 0.76, t(12.08) = 2.78, p = .017. The intercept was not significant, b = 1.60, SE = 5.34, t(3.95) = 0.30, p = .779.

Random effects indicated substantial variability in intercepts across majors (SD = 11.87), and a smaller but perfectly correlated variability in slopes for grade (SD = 0.35). The residual standard deviation was 2.88.



Linear model vs Linear multilevel model

  • The interaction linear model pools the variation in grade in each major separately (similar to the unpooled models)
  • After fitting the interaction, the average (and fixed) effect of grade does not explain sufficient variation to exceed signficance
  • The linear multilevel model shares or partially pools variation between fixed (average) and random (major) effects
  • Variation is attributed to each parameter according to how well it explains each effect (i.e., it’s own error relative to residual error).
  • After partial pooling the random intercept of major soaks up sufficient variation to leave the fixed (average) effect of grade significant





Model diagnostics

Just like a linear model (OLS), LMM assume the residuals are normal with constant variance, and colinearity between independent variables will reduce power/precision of estimates.

plot(lmm.full)

  • Plotting the residual vs fitted results should be normally distributed, show no trends, with equal (homogenous) variance


An additional assumption of LMM is there is not heteroscedasticity among different levels of the random effects. Again, we can check this using plots of the residuals, simply split by levels of the random effect of interest (e.g., major)

plot(lmm.full, 
     resid(., scaled=TRUE) ~ fitted(.) | major, 
     abline=0, pch=16, 
     xlab="Fitted values", 
     ylab="Standardised residuals")

  • Plotting the residual vs each independent variable should also be normal, homogenous and no trends
  • This can be hard to determine if the sample size in each level is low (hence rules of thumb about minimum sample sizes per level, e.g., > 5)


plot(lmm.full, as.factor(major) ~ resid(., scaled=TRUE), 
     abline=0, pch=16,
     xlab="Standardised residuals", 
     ylab="Major")

  • You can also compare the difference in residuals between random effect levels


lattice::dotplot(ranef(lmm.full, condVar=TRUE))
## $major

  • Most of the variability is accounted for by the random intercept
  • Little variability in random slope (grade)
  • Maybe some positive relationship between intercept and slope



Trouble-shooting

Eventually you will receive a warning about model convergence. This can have many causes, which may need to be investigated and tested.

  • Try fitting a simpler model, remove parameters or terms from the model, or simplify the random-effects structure
  • Centre and standardize your independent variables (e.g., z-scores). Also helps compare magnitude of effects
  • Try a different convergence algorithm control = lmerControl(optimizer = "bobyqa")
  • Try a Bayesian model brms::brm(salary ~ 1 + grade + (1 + grade | major), data = sim_data)



Missing data

A major advantage of multilevel models is how they handle missing data. Other methods will give the same results, but they depend on balance. The design must be completely balanced and the resulting data must also be completely balanced. Balance is fragile. Even if the design is balanced, a single missing or questionable observation destroys the balance. Observational studies as opposed to, say, laboratory experiments cannot be expected to yield balanced data sets.



Multilevel experimental design

Random effects can be nested (e.g., see above) or crossed.

For instance in a clinical research design, we could have repeated measures on individuals in different treatment groups. Each treatment is provided by different therapists, so subjects are nested by therapist (and therapists are nested by treatment).

3levelnested lmer(y ~ time * tx + (time | therapist/subjects))

To fit a nested effect we use therapist/subjects, which specifies nesting.


Alternatively, we could have a design with subject-level randomization to treatment; so each therapist delivers both treatment arms. Therapist now represents a crossed effect:

3levelcrossed
3levelcrossed

lmer(y ~ time * tx + (time | therapist:subjects) + (time * tx | therapist))

The crossed specification therapist:subjects is equivalent to subjects:therapist (order does not matter). Also note that crossed is simply “not nested”. Experimental situations with crossed random factors, such as “subject” and ’stimulus”, are common.

Important note: Nesting or crossing is a property of the data, or rather the experimental design, not the model. However correct specification is necessary when the nested variable (e.g., subject) is not coded uniquely across the higher factor (e.g., therapists). You can avoid confusion about nested and crossed factors by following one simple rule: ensure that different levels of a factor in the experiment correspond to different labels of the factor in the data.


Dr Kristoffer Magnusson has a helpful page on different multilevel designs and how to estimate them using lme4:



Summary

Multilevel models are a method for:

  • estimating variation among groups
  • a compromise between complete pooling (no variance between groups), no pooling (fixed effect) with all variance between groups
  • handling groups selected at random from a larger population
  • sharing information among groups (shrinkage estimation)
  • allowing predictions for unobserved instances (e.g., missing data)


Advantages are:

  • accounting for non-independence because your observations are grouped
  • improving predictions from linear models by taking advantage of shrinkage to get better estimates

But:

  • estimates are biased (shrunk towards the fixed effect or zero), which is good for prediction but bad for estimating the true causal effect
  • get complicated quickly


Random effect formula:

  • simplest: 1|g, single intercept per level of g
  • nested: 1|g1/g2
  • random-slopes: x|g
  • independent terms: (1|g) + (0 + x|g) or (x||g)



Other Resources