datarich(ard)

a personal journey into data science

View My GitHub Profile

18 May 2021

Aggregated regression in R

by datarich(ard)

Aggregated regression

Aggregated regression performs the regression on aggregated (i.e., summary) data rather than individual data. For binomial and quasibinomial families, the glm function allows you to specify the response variable as a two-column matrix with the columns giving the numbers of successess and failures.

e.g., formula = cbind(successes, failures) ~ x1 + x2 + 1

The weights argument in glm can be used to indicate the precision of each observation. Equivalently, when the elements of weights are positive integers, each element indicates the number of observations in each unit. The weights can also be used to give the number of trials when the response is the proportion of successes in a binomial GLM.

Examples

Normally we perform regression on individual data. For example, using the “mtcars” dataset which contains the records of 30-odd different cars:

mtcars %>%
  head(10) %>%
  knitr::kable()
  mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4


A logistic regression to predict the transmission type (“am ~ cyl + gear”) would be performed like this:

summary(glm(am ~ as.factor(cyl) + carb, 
            data = mtcars, 
            family = binomial(link = "logit")))
## 
## Call:
## glm(formula = am ~ as.factor(cyl) + carb, family = binomial(link = "logit"), 
##     data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8699  -0.5506  -0.1869   0.6185   1.9806  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.6718     1.0924  -0.615  0.53854   
## as.factor(cyl)6  -3.7609     1.9072  -1.972  0.04862 * 
## as.factor(cyl)8  -5.5958     1.9381  -2.887  0.00389 **
## carb              1.1144     0.5918   1.883  0.05967 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 26.287  on 28  degrees of freedom
## AIC: 34.287
## 
## Number of Fisher Scoring iterations: 5


Note the degrees of freedom indicate the number of observations in the dataset.

On the other hand, aggregated or summary data might look like this:

Aggregated counts
mtcars_cnt <- count(mtcars, cyl, carb, am)

knitr::kable(mtcars_cnt)
cyl carb am n
4 1 0 1
4 1 1 4
4 2 0 2
4 2 1 4
6 1 0 2
6 4 0 2
6 4 1 2
6 6 1 1
8 2 0 4
8 3 0 3
8 4 0 5
8 4 1 1
8 8 1 1


An aggregated regression is performed using the weights argument to indicate the units per response:

summary(glm(am ~ as.factor(cyl) + carb, 
            data = mtcars_cnt, 
            family = binomial(link = "logit"),
            weights = n
            ))
## 
## Call:
## glm(formula = am ~ as.factor(cyl) + carb, family = binomial(link = "logit"), 
##     data = mtcars_cnt, weights = n)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6445  -1.2312  -0.3738   1.2369   1.9923  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.6718     1.0925  -0.615  0.53858   
## as.factor(cyl)6  -3.7609     1.9074  -1.972  0.04865 * 
## as.factor(cyl)8  -5.5958     1.9383  -2.887  0.00389 **
## carb              1.1144     0.5919   1.883  0.05971 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 12  degrees of freedom
## Residual deviance: 26.287  on  9  degrees of freedom
## AIC: 34.287
## 
## Number of Fisher Scoring iterations: 5


Note the degrees of freedom are incorrect, but the estimates, standard errors and p-values are very similar (although not identical).


Aggregated data might also look like this:

Aggregated trials (successes, failures)
mtcars_wcnt <- count(mtcars, cyl, carb, am) %>%
  spread(am, n, fill = 0) %>%
  rename(auto = `1`, man = `0`) 

knitr::kable(mtcars_wcnt)
cyl carb man auto
4 1 1 4
4 2 2 4
6 1 2 0
6 4 2 2
6 6 0 1
8 2 4 0
8 3 3 0
8 4 5 1
8 8 0 1


The formula and weights arguments need updating:

summary(glm(cbind(auto, man) ~ as.factor(cyl) + carb, 
            data = mtcars_wcnt, 
            family = binomial(link = "logit")
            ))
## 
## Call:
## glm(formula = cbind(auto, man) ~ as.factor(cyl) + carb, family = binomial(link = "logit"), 
##     data = mtcars_wcnt)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.9179  -0.9407  -0.3772  -0.0251   0.4468  -0.3738  -0.5602   0.1789  
##       9  
##  0.3699  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.6718     1.0925  -0.615  0.53858   
## as.factor(cyl)6  -3.7609     1.9074  -1.972  0.04865 * 
## as.factor(cyl)8  -5.5958     1.9383  -2.887  0.00389 **
## carb              1.1144     0.5919   1.883  0.05971 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19.6356  on 8  degrees of freedom
## Residual deviance:  2.6925  on 5  degrees of freedom
## AIC: 18.485
## 
## Number of Fisher Scoring iterations: 5


Aggregated proportions
mtcars_percent <- mtcars_p <- mtcars %>%
  group_by(cyl, carb) %>%
  summarise(
    n = n(),
    am  = sum(am)/n
  )
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
knitr::kable(mtcars_percent)
cyl carb n am
4 1 5 0.8000000
4 2 6 0.6666667
6 1 2 0.0000000
6 4 4 0.5000000
6 6 1 1.0000000
8 2 4 0.0000000
8 3 3 0.0000000
8 4 6 0.1666667
8 8 1 1.0000000


The formula and weights arguments:

summary(glm(am ~ as.factor(cyl) + carb, 
            data = mtcars_percent, 
            family = binomial(link = "logit"),
            weights = n
            ))
## 
## Call:
## glm(formula = am ~ as.factor(cyl) + carb, family = binomial(link = "logit"), 
##     data = mtcars_percent, weights = n)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.9179  -0.9407  -0.3772  -0.0251   0.4468  -0.3738  -0.5602   0.1789  
##       9  
##  0.3699  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.6718     1.0925  -0.615  0.53858   
## as.factor(cyl)6  -3.7609     1.9074  -1.972  0.04865 * 
## as.factor(cyl)8  -5.5958     1.9383  -2.887  0.00389 **
## carb              1.1144     0.5919   1.883  0.05971 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19.6356  on 8  degrees of freedom
## Residual deviance:  2.6925  on 5  degrees of freedom
## AIC: 18.485
## 
## Number of Fisher Scoring iterations: 5


So far very similar results across aggregates and individual datasets, with the only different the degrees of freedom (which will need correcting). However as we start to add independent variables (especially continuous variables), results begin to diverge. Below we add mpg as a predictor:

Individual
summary(glm(formula = am ~ as.factor(cyl) + carb + mpg,
      family = binomial,
      data = mtcars))
## 
## Call:
## glm(formula = am ~ as.factor(cyl) + carb + mpg, family = binomial, 
##     data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8933  -0.4595  -0.1293   0.1475   1.6969  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -18.3024     9.3442  -1.959   0.0501 .
## as.factor(cyl)6  -1.8594     2.5963  -0.716   0.4739  
## as.factor(cyl)8  -0.3029     2.8828  -0.105   0.9163  
## carb              1.6959     0.9918   1.710   0.0873 .
## mpg               0.6771     0.3645   1.858   0.0632 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 18.467  on 27  degrees of freedom
## AIC: 28.467
## 
## Number of Fisher Scoring iterations: 6


Aggregated proportions
mtcars_percent <- mtcars %>%
  group_by(cyl, carb) %>%
  summarise(
    n = n(),
    am  = sum(am)/n,
    mpg = mean(mpg)
  )
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
knitr::kable(mtcars_percent)
cyl carb n am mpg
4 1 5 0.8000000 27.58
4 2 6 0.6666667 25.90
6 1 2 0.0000000 19.75
6 4 4 0.5000000 19.75
6 6 1 1.0000000 19.70
8 2 4 0.0000000 17.15
8 3 3 0.0000000 16.30
8 4 6 0.1666667 13.15
8 8 1 1.0000000 15.00


glm(formula = am ~ as.factor(cyl) + carb + mpg,
    family = binomial,
    data = mtcars_percent,
    weights = n
  ) %>%
  summary()
## 
## Call:
## glm(formula = am ~ as.factor(cyl) + carb + mpg, family = binomial, 
##     data = mtcars_percent, weights = n)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6         7         8  
##  0.75845  -0.73755  -0.24505  -0.02649   0.34041  -0.50528  -0.74002   0.46178  
##        9  
##  0.17387  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -11.3593    19.9611  -0.569    0.569
## as.factor(cyl)6  -1.7932     3.7491  -0.478    0.632
## as.factor(cyl)8  -1.4419     7.3124  -0.197    0.844
## carb              1.4059     1.0718   1.312    0.190
## mpg               0.3825     0.7014   0.545    0.585
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19.6356  on 8  degrees of freedom
## Residual deviance:  2.3423  on 4  degrees of freedom
## AIC: 20.134
## 
## Number of Fisher Scoring iterations: 6


The standard errors are now larger, and the loss of efficiency is affecting inference.



Also, beware of trying to aggregate like so:

Aggregated counts
mtcars_cnt <- mtcars %>%
  group_by(cyl, carb, am) %>%
  summarise(
    mpg = mean(mpg),
    n = n()
    )
## `summarise()` has grouped output by 'cyl', 'carb'. You can override using the `.groups` argument.
knitr::kable(mtcars_cnt)
cyl carb am mpg n
4 1 0 21.50 1
4 1 1 29.10 4
4 2 0 23.60 2
4 2 1 27.05 4
6 1 0 19.75 2
6 4 0 18.50 2
6 4 1 21.00 2
6 6 1 19.70 1
8 2 0 17.15 4
8 3 0 16.30 3
8 4 0 12.62 5
8 4 1 15.80 1
8 8 1 15.00 1


glm(formula = am ~ as.factor(cyl) + carb + mpg,
    family = binomial,
    data = mtcars_cnt,
    weights = n
  ) %>%
  summary()
## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Call:
## glm(formula = am ~ as.factor(cyl) + carb + mpg, family = binomial, 
##     data = mtcars_cnt, weights = n)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -1.885e-05  -4.710e-08  -2.110e-08   4.210e-08   1.445e-05  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -605.51 1070920.77  -0.001    1.000
## as.factor(cyl)6     -10.83  280328.45   0.000    1.000
## as.factor(cyl)8      90.23  371331.49   0.000    1.000
## carb                 56.42   79751.45   0.001    0.999
## mpg                  19.78   38329.66   0.001    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.3230e+01  on 12  degrees of freedom
## Residual deviance: 7.5814e-10  on  8  degrees of freedom
## AIC: 10
## 
## Number of Fisher Scoring iterations: 25
tags: