Partial pooling

Historically (at least in psychology) the justification to use mixed-level models and partial pooling was to deal with dependencies in the $y_i$s. Partial pooling also produces more precise fixed estimates, but this can introduce problems with correctly controlling $p = .05$ in the frequentist setting. And the resulting random effect estimates (conditional effects) are biased towards the fixed effect (i.e., zero), and so traditionally have not been considered suitable to represent the “true” causal effect, which is often the aim of an experimental science such as psyschology. For these reasons, common advice is to assign a variable as a random effect only when you are not interested in the estimates (i.e., as a control or potential confounding variable).

There are good reasons to be interested in the conditional estimates produced by partial pooling, beginning with the James Stein example that a shrunken estimate ($\hat \theta$) is a better estimate of the population mean ($\theta$) than the sample mean. Part of the intuition behind this paradox is that the sample mean is overly vulnerable to influence or error from outliers in the sample, and any shrunken $\hat \theta$ will reduce that influence. The sample mean is essentially overfit to the sample (but unbiased), while $\hat \theta$ is underfit (and biased).

The intuition can be extended in the context of mixed-level (linear) models when we consider that $\hat \beta$ is a model of a latent variable $\beta$. When modelling a latent variable such as $\beta$, then $\hat \beta$ is the best unbiased estimate of the true latent variable $\beta$ (see Douglas Bates et al and other literature on BLUP vs BLUE, eg https://pubmed.ncbi.nlm.nih.gov/10523752/).

Michael Clarke has an excellent blogpost on shrinkage in random effect models (random intercept + random slope).

The amount of shrinkage is a function of the relative variance/precision of the group (j) estimates:

$$ \[\begin{align} \hat U_j &= \rho_j \bar y_j + (1 - \rho_j) \bar y \\ \\ \text{where} \\ \rho_j &= \frac{\tau^2}{(\tau^2 + \sigma^2 / n_j)} \end{align}\] $$

Gaussian example

In this example, we are predicting salary among graduates from different majors.

observations [i] belong to subgroups [j]
subgroups [j] are explicitly treated as random draws from a larger population

\[ \begin{align} \text{Level 1}: y_{i} &= \alpha_{j[i]} + \beta_1 x_{i} + \epsilon_i \\ \alpha_{j[i]} &= \beta_0 + \beta_{0j} \\ \text{Level 2}: \beta_{0j} &\sim N(0, \ \tau) \\ \epsilon_i &\sim N(0, \ \sigma) \end{align} \]

$\alpha_{j[i]}$ indicates that each cluster $j$ (e.g., major) is drawn from a normal distribution centered on $\beta_0$, the grand intercept.
In addition to the fixed effects ($\beta_0$, $\beta_1$), the model estimates the variance or standard deviation of $\beta_{0j}$, i.e., $\tau$
In addition, the residual standard deviation $\sigma$ is estimated as error

The parameters governing the amount of shrinkage are tau ($\tau$), $n_j$, and sigma ($\sigma$). We will vary tau (tau_lo, tau_hi) as well as $n$ (n1, n2, n3) to observe the effect on shrinkage below.

average_salary = 5
tau_lo = 1.5
tau_hi = 6
sigma = 3
n1 = 5
n2 = 10
n3 = 30
seed = 6 # 9, 6, 1, 5 # for replicability

set.seed(seed) 
sim_lo <- tibble(
  major = c(rep(letters[1:8], each = n1),
            rep(letters[9:16], each = n2),
            rep(letters[17:24], each = n3))
) %>%
  rownames_to_column(var= "id") %>% 
  group_by(major) %>% 
  mutate(
    n = n(),
    tau = tau_lo,
    alpha_j = rnorm(1, average_salary, tau_lo),
    salary = rnorm(n(), alpha_j, sigma)
  ) %>%
  ungroup()

set.seed(seed) 
sim_hi <- tibble(
  major = c(rep(letters[1:8], each = 5),
            rep(letters[9:16], each = 10),
            rep(letters[17:24], each = 30))
) %>%
  rownames_to_column(var= "id") %>% 
  group_by(major) %>% 
  mutate(
    n = n(),
    tau = tau_hi,
    alpha_j = rnorm(1, average_salary, tau_hi),
    salary = rnorm(n(), alpha_j, sigma)
  ) %>%
  ungroup()

sim_data <- bind_rows(sim_lo, sim_hi)

sim_data %>%
  ggplot(aes(x = major, y = salary)) +
    geom_point() +
    facet_grid(tau~n, scales = "free_x") +
    theme(axis.ticks = element_blank(),
          panel.grid = element_blank(),
          legend.position = "none")

Partial pooling estimates

Partially-pooled estimates can be obtained from brms or lme4. The formula syntax is very similar between each package, with the major difference being the definition of priors. However default priors in brms are usually sufficient, or at least a good starting point. Here we define priors to illustrate how it can be done, and offer the opportunity to test different priors (e.g., “exponential(1)” vs “exponential(10)”).

# Partial pooling
lmm_lo <- brm(salary ~ 1 + (1|major),
              data = filter(sim_data, tau == tau_lo),
              prior = c(prior(normal(0, 3), class = Intercept),
                        prior(exponential(10), class = sd)))

lmm_hi <- brm(salary ~ 1 + (1|major),
              data = filter(sim_data, tau == tau_hi),
              prior = c(prior(normal(0, 3), class = Intercept),
                        prior(exponential(10), class = sd)))

After fitting, we extract the conditional estimates (partially-pooled) for each major using coef(). The conditional estimates are simply the random estimates + fixed estimates.

fixed_effects <- data.frame(
  tau = c(tau_lo, tau_hi),
  fixed = c(fixef(lmm_lo)[1], fixef(lmm_hi)[1])
)
            

cond_effects <- bind_rows(
  # partially pooled estimates:
  coef(lmm_lo, robust=T)$major[, , ] %>% 
    as_tibble(rownames = "major") %>%
    transmute(
      major,
      est.brms = Estimate # brms conditional effects
    ) %>%
    # unpooled estimates from lm():
    left_join(
      lm(salary ~ 0 + major, data = filter(sim_data, tau == tau_lo)) %>%
        broom::tidy() %>%
        transmute(
          major = str_remove(term, "major"),
          est.lm = estimate
        ),
      by = "major") %>%
    mutate(tau = tau_lo),
  # partially pooled estimates:
  coef(lmm_hi, robust=T)$major[, , ] %>% 
    as_tibble(rownames = "major") %>%
    transmute(
      major,
      est.brms = Estimate # brms conditional effects
    ) %>%
    # unpooled estimates from lm():
    left_join(
      lm(salary ~ 0 + major, data = filter(sim_data, tau == tau_hi)) %>%
        broom::tidy() %>%
        transmute(
          major = str_remove(term, "major"),
          est.lm = estimate
        ),
      by = "major") %>%
    mutate(tau = tau_hi)
)

Conditional estimates are compared to unpooled estimates from a linear model (lm()).

# summarise data (equivalent to unpooled estimates below)
sum_data <- sim_data %>%
  group_by(tau, major) %>%
  summarise(
    n = n(),
    tau = unique(tau),
    salary_m = mean(salary),
    salary_sd = sd(salary),
    .groups = "drop"
  )

sum_data %>%
  left_join(cond_effects, by = c("major", "tau")) %>%
  left_join(fixed_effects, by = "tau") %>%
  mutate(tau = paste("\u03C4 =", tau),
         n = paste("n =", n),
         n = fct_relevel(n, "n = 5")) %>%
  ggplot(aes(x = major)) +
    geom_point(aes(y = est.lm), shape = 1) +
    geom_hline(aes(yintercept = fixed), color = "grey65", size = 1) +
    geom_point(aes(y = est.brms), color = "grey65") +
    facet_grid(tau~n, scales = "free_x") +
    labs(title = "Effect of partial pooling: Conditional estimates (filled) vs unpooled (unfilled)",
         subtitle = "Horizontal line is the fixed estimate (population-level intercept)",
         y = "", caption = prior_summary(lmm_hi)[2, 1]) +
    theme(panel.grid = element_blank(),
          axis.ticks = element_blank())

key points

The conditional estimates are generally shifted towards the fixed effect relative to the unpooled estimates (shrinkage)
More shrinkage occurs as the estimate is further from the fixed effect
Less shrinkage occurs with larger $n_j$
Less shrinkage occurs as the variance between groups increases ($\tau$)

“Tau represents variation around the fixed effect, so when tau is large we shouldn’t trust the fixed effect and we shouldn’t shrink to it”

Comparing estimates

Calculating the difference between each estimate and the true average salary for each major ($\alpha_j$) can tell us which estimate-type is best.

select(sim_data, major, n, tau, alpha_j) %>%
  distinct() %>%
  left_join(cond_effects, by = c("major", "tau")) %>%
  mutate(mse.brms = (est.brms - alpha_j)^2,
         mse.lm = (est.lm - alpha_j)^2) %>%
  select(major, n, tau, mse.brms, mse.lm) %>%
  gather("mse", "val", mse.brms, mse.lm) %>%
  group_by(n, tau, mse) %>%
  mutate(average = mean(val),
         tau = paste("\u03C4 =", tau),
         n = paste("n =", n)) %>%
  ungroup() %>%
  mutate(n = fct_relevel(n, "n = 5")) %>%
  ggplot(aes(x = major, y = val, color = tau)) +
    geom_point(aes(shape = mse)) +
    geom_hline(aes(yintercept = average, color = tau, linetype = mse)) +
    facet_grid(tau~n, scales = "free_x") +
    labs(title = "Mean square error from unpooled (unfilled) and partially-pooled (filled)",
         subtitle = "Horizontal line: average mse",
         y = "") + 
    scale_shape_manual(values = c(19, 21)) +
    theme(legend.position = "none",
          panel.grid = element_blank())

key points

Partial pooling produces less error, i.e., better estimates of true group average ($\alpha_j$)
Error decreases as group size ($n$) increases

Partial pooling

Dr R. Morris, School of Psychology, University of Sydney

Sep 21, 2022

Gaussian example

Partial pooling estimates

Comparing estimates