Here is a video [4 mins] to introduce the chapter.
6.1 Overview
In this chapter, we deal with data sets in which the response variable \(y\) is a count and the explanatory variables are all factors – i.e. qualitative variables. Initially we assume \(y\) has a Poisson distribution.
See Chapter 9 of Dobson and Barnett (2008). See also Agresti (1996) An introduction to categorical data analysis.
We assume a generalized linear model with
responses (counts) having independent Poisson distributions;
a logarithmic link function (hence the name log-linear model).
Consider, for example, the two-way contingency table in Table 6.1}:
Table 6.1: A two-way contingency table with \(k_1\) rows and \(k_2\) columns, where each entry \(y_{ij}\) is a count.
1
2
\(\dots\)
\(j\)
\(\dots\)
\(k_2\)
Total
1
\(y_{11}\)
\(y_{12}\)
\(\dots\)
\(y_{1j}\)
\(\dots\)
\(y_{1k_2}\)
\(y_{1+}\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(i\)
\(y_{i1}\)
\(y_{12}\)
\(\dots\)
\(y_{1j}\)
\(\dots\)
\(y_{ik_2}\)
\(y_{i+}\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(\dots\)
\(k_1\)
\(y_{k_11}\)
\(y_{k_12}\)
\(\dots\)
\(y_{1j}\)
\(\dots\)
\(y_{k_2k_2}\)
\(y_{1k_1}\)
Total
\(y_{+1}\)
\(y_{+2}\)
\(\dots\)
\(y_{1j}\)
\(\dots\)
\(y_{+k_2}\)
\(y_{++}\)
In row \(i\) and column \(j\) of Table 6.1} we assume \[
Y_{ij} \sim \text{Po}(\lambda_{ij})
\] where \[
\log \lambda_{ij} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij}.
\tag{6.1}\] Here \(\mu\) is called the main effect; \(\alpha_i\) is a row effect; \(\beta_j\) is a column effect; and \((\alpha \beta)_{ij}\) denotes an interaction effect parameter. Some or all of these effects must be present in the model, with constraints to ensure that the model is identifiable – this is sometimes refered ot as the identifiability or aliasing problem. Generally, R function automatically sets the first level of each effect to zero to achieve model identification. Thus, for the two-way model, \[
\alpha_1 = 0, \quad \beta_1 = 0, \quad
(\alpha \beta)_{11}=(\alpha \beta)_{1j}=(\alpha \beta)_{k_1}=0.
\]
6.2 Motivating examples
6.2.1 Malignant melanoma
The data in Table 6.2 are from a study of 400 patients with malignant melanoma, a particular form of skin cancer (see Dobson and Barnett, p.172). For each tumour, its type and site were recorded. The data in Table 6.2 comprise the numbers of tumours \(y\) in each combination of site and tumour-type. This is a two-way contingency table. We want to know how melanoma frequency depends on site and type.
Table 6.2: Melanoma counts by type and site.
Type
Head
Trunk
Extremities
Total
Hutchinson’s melanotic freckle
22
2
19
34
Superficial spreading melanoma
16
54
115
185
Nodular
19
33
73
125
Indeterminate
11
17
28
56
Total
68
106
226
400
Table 6.3: Percentages across columns within rows for melanoma data.
Type
Head
Trunk
Extremities
Total
Hutchinson’s melanotic freckle
64.7
5.9
29.4
100
Superficial spreading melanoma
8.6
29.2
62.2
100
Nodular
15.2
26.4
58.4
100
Indeterminate
19.6
30.4
50.0
100
Total
17.0
26.5
56.5
100
Table 6.4: Percentages across rows within columns for melanoma data.
Type
Head
Trunk
Extremities
Total
Hutchinson’s melanotic freckle
32.4
1.9
4.4
8.5
Superficial spreading melanoma
23.5
50.9
50.9
46.3
Nodular
27.9
31.1
32.3
31.3
Indeterminate
16.2
16.0
12.4
14.0
Total
100.0
99.9
100.0
100.0
Although we have two factors, and , that we may use as predictors, standard ANOVA regression methods are inappropriate here as the dependent variable is not continuous but is instead a count. We will use log-linear regression, a type of generalized linear model, to analyse these data.
Table 6.3 shows row and Table 6.4 column percentages for these data. For example, 15.2% of nodular melanomas occurred in the head and neck, 26.4% in the trunk, and 58.4% in the extremities. Compare this to the equivalent figures for Hutchinson’s melanotic freckles: 64.7%, 5.9%, and 29.4% - strikingly different. So different types of melanomas are more likely to occur in different locations.
6.2.2 Flu vaccine
The data in Table 6.5 are from a randomized controlled trial in which 73 patients were randomized into two groups (see Dobson and Barnett, 2008, p.173). The treatment group was given a flu vaccine, while the control group was given a placebo. Levels of an antibody (HIA) were measured after 6 weeks and classified into three groups: Low, Moderate, and High.
Table 6.5: Antibody responses to flu vaccine from a randomized controlled trial.
Group
Low
Moderate
High
Placebo
25
8
5
Vaccine
6
18
11
Total
31
26
16
Is the pattern of response the same for each treatment group? The percentages in Table 6.6 suggest not - row percentages indicate lower responses in the placebo group.
Table 6.6: Percentages across rows within columns for antibody responses to flu vaccine.
Group
Low
Moderate
High
Total
Placebo
65.8
21.1
13.2
100
Vaccine
17.1
51.4
31.4
100
Total
42.4
35.6
22.0
100
Table 6.7: Percentages across rows within columns for antibody responses to flu vaccine.
Group
Low
Moderate
High
Placebo
80.6
30.8
31.2
Vaccine
19.4
69.2
68.8
Total
100
100
100
6.3 Maximum likelihood estimation
Recall that for each cell \(Y_{ij} \sim \text{Po}(\lambda_{ij})\) so \(\mbox{E}[Y_{ij}]= \lambda_{ij}\). However, we estimate \(\hat{\lambda}_{ij} = y_{ij}\) only for the saturated model given by Equation 6.1. In general, for non-saturated models, the estimate \(\hat\lambda_{ij} \ne y_{ij}\).
Consider the independence model for a 2-way table, that is where \((\alpha \beta)_{ij} = 0\) for all \(i,j\). Here, \[
\log \lambda_{ij} = \mu +\alpha_i + \beta_j \tag{6.2}\] and \(y_{ij}\) is the observed count for cell \((i,j)\), so the likelihood is given by \[
L(\lambda; y)
= \prod_{i,j} \frac{e^{-\lambda_{ij}} \lambda_{ij}^{y_{ij}}}{y_{ij}!}
\] and the log-likelihood is \[
l(\lambda; y)
= \sum_{i,j} \left\{ y_{ij} \log \lambda_{ij} - \lambda_{ij} -
\log y_{ij}!\right\}.
\] Next, using Equation 6.2, gives \[
l(\lambda; y) = \sum_{i,j} \left\{ y_{ij} (\mu + \alpha_i + \beta_j)
- \exp(\mu + \alpha_i + \beta_j) - \log y_{ij}! \right\}
\] which can be re-written as \[
l(\lambda; y) = \mu y_{++} + \sum_i \alpha_i y_{i+} + \sum_j \beta_j y_{+ j}
- e^{\mu} \left(\sum_i e^{\alpha_i}\right) \left(\sum_j e^{\beta_j}
\right) - \sum_{ij} \log y_{ij}!.
\tag{6.3}\]
The maximum likelihood estimates of the model parameters are obtained in the usual way. Differentiating Equation 6.3 with respect to \(\mu\) and setting the result to zero gives, at the MLE, \[
y_{++} = e^{\hat{\mu}} \left(\sum_i e^{\hat{\alpha}_i}\right)
\left(\sum_j e^{\hat{\beta}_j}\right).
\tag{6.4}\]
Differentiating Equation 6.3 with respect to \(\alpha_i\) (where \(i \ne 1\) because \(\alpha_1=0\) to avoid an identifiability problem) and setting the result to zero gives, at the MLE, \[
y_{i+} = e^{\widehat{\mu}} e^{\widehat{\alpha}_i} \left(\sum_j e^{\widehat{\beta}_j}\right). \notag \\
% = \sum_j \widehat{\lambda_{ij}} = \widehat{\lambda}_{i+} ~~~~(i \not= 1).
\tag{6.5}\] Differentiating Equation 6.3 with respect to \(\beta_j\) (where \(j \not= 1\) because \(\beta_1=0\)) and setting the result to zero gives, at the MLE, \[
y_{+j} = e^{\widehat{\mu}} e^{\widehat{\beta}_j} \left(\sum_i e^{\widehat{\alpha}_i}\right).
\tag{6.6}\] Then \[
\frac{y_{i+}\ y_{+j}}{y_{++}} = e^{\widehat{\mu}+\widehat{\alpha}_i+\widehat{\beta}_j}
= \widehat{\lambda}_{ij}.
\tag{6.7}\] It follows that \[
\widehat{\lambda}_{i+} = y_{i+} \quad
\widehat{\lambda}_{+j} = y_{+j} \quad
\widehat{\lambda}_{++} = y_{++}.
\]
Thus, the total fitted count in row \(i\) is identical to the total observed count in row \(i\). Further, the total fitted count in column \(j\) is equal to the total observed count in column \(j\) and the total fitted count equals the total observed count.
6.4 Model fitting in R
6.4.1 Malignant melanoma
Consider an analysis of Melanoma data introduced in Section 6.2.1.
To test the independence of \(\texttt{Type}\) and \(\texttt{Size}\), we fit the model \[
\texttt{Count} \sim \texttt{Type} + \texttt{Site}
\] assuming Poisson counts and a logarithmic link function, using the commands:
Call:
glm(formula = mcount ~ type.F + site.F, family = "poisson")
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0453 -1.0741 0.1297 0.5857 5.1354
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7544 0.2040 8.600 < 2e-16 ***
type.F2 1.6940 0.1866 9.079 < 2e-16 ***
type.F3 1.3020 0.1934 6.731 1.68e-11 ***
type.F4 0.4990 0.2174 2.295 0.02173 *
site.F2 0.4439 0.1554 2.857 0.00427 **
site.F3 1.2010 0.1383 8.683 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 295.203 on 11 degrees of freedom
Residual deviance: 51.795 on 6 degrees of freedom
AIC: 122.91
Number of Fisher Scoring iterations: 5
We see that the residual deviance for this model is \(51.795\) on \(6\) degrees of freedom. If the model is true, the residual deviance will have an approximate \(\chi^2\) distribution on \(6\) degrees of freedom. If the model is not true, the residual deviance will probably be too large to correspond to this distribution. Thus we calculate the \(p\)-value in the upper tail of the \(\chi^2_6\) distribution, which can be computed using the command:
Code
pchisq(51.795,6,lower.tail=F)
[1] 2.050465e-09
This strongly indicates that the independence model is inadequate. Therefore, unless we can spot alternative simplifications, we will have to use the saturated model.
We can look at residuals from the model see where the departures from independence occur. The largest residual is for Hutchinson’s freckle on the head and neck, which occurs more often than would be expected under independence.
For the saturated model, \(\widehat{\lambda}_{ij} = y_{ij}\). In this example, \(\sum_{ij} \widehat{\lambda}_{ij} = \sum_{ij} y_{ij} = 400\), so the probability of a tumour being in category \((i,j)\) is \(y_{ij}/400\) — just the observed proportions.
Note that in Table 6.2 we have a total of \(y_{++} = 400\) observations. If the data were truly Poisson, this would be a suspiciously round number. In reality, this total was fixed by design, so we should take into account the fact that \(y_{++} = 400\) and fit a more suitable model, such as the multinomial model which we will meet in the next chapter.
Overdispersion can occur for the Poisson model, just as in the binomial case – see Section 5.3. This is called extra-Poisson variation. The \(\texttt{glm}\) function in R can take this into account by including an extra parameter \(\tau\) in the model using \(\texttt{family=quasipoisson()}\).
6.4.2 Flu vaccine
There is no saved data file for this example – recall it is very small - and so define R variables for the \(\texttt{count}\), \(\texttt{response}\) (perhaps not a good name as it may be confusing but it seems the correct description from the context) and \(\texttt{drug}\) used. Don’t forget to declare the explanatory variables as factors.
Now fit the first model, here consider the saturated model with main effects and interaction.
Code
########################################### Fit a log-linear model - saturated modelglm.fit0 =glm(count ~ response.F * drug.F, family=poisson)summary(glm.fit0)
Call:
glm(formula = count ~ response.F * drug.F, family = poisson)
Deviance Residuals:
[1] 0 0 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.60944 0.44721 3.599 0.00032 ***
response.FL 1.60944 0.48990 3.285 0.00102 **
response.FM 0.47000 0.57009 0.824 0.40969
drug.FV 0.78846 0.53936 1.462 0.14379
response.FL:drug.FV -2.21557 0.70539 -3.141 0.00168 **
response.FM:drug.FV 0.02247 0.68663 0.033 0.97389
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2.3807e+01 on 5 degrees of freedom
Residual deviance: -2.6645e-15 on 0 degrees of freedom
AIC: 37.128
Number of Fisher Scoring iterations: 3
Even though the model is a perfect fit – the fitted values are exactly the data values (check next) we see that \(\texttt{response.F="M"}\) is not significantly different to the baseline \(\texttt{response.F="H"}\) and also that \(\texttt{drug.F="V"}\) is not significantly different to \(\texttt{drug.F="P"}\).
Let’s check the fitted values:
Code
predict(glm.fit0, type="response")
1 2 3 4 5 6
25 8 5 6 18 11
where we see that, indeed, they are the same as the data.
The Null deviance is \(23.807\) and follows a \(\chi^2\) distribution on \(5\) degrees of freedom (under the Null hypothesis). This has a p-value of \(p<0.001\) which is highly significant and hence the null model is not adequate.
We cannot test the saturated model as it fits perfectly.
Next let’s try a reduced model with the interaction removed:
Code
########################################### Fit model without interactionglm.fit1 =glm(count ~ response.F + drug.F, family=poisson)summary(glm.fit1)
Call:
glm(formula = count ~ response.F + drug.F, family = poisson)
Deviance Residuals:
1 2 3 4 5 6
2.040 -1.630 -1.247 -2.615 1.469 1.128
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.11972 0.27408 7.734 1.04e-14 ***
response.FL 0.66140 0.30783 2.149 0.0317 *
response.FM 0.48551 0.31774 1.528 0.1265
drug.FV -0.08224 0.23428 -0.351 0.7256
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 23.807 on 5 degrees of freedom
Residual deviance: 18.643 on 2 degrees of freedom
AIC: 51.771
Number of Fisher Scoring iterations: 5
The \(\texttt{response.F="M"}\) is again not significantly different to the baseline \(\texttt{response.F="H"}\) and \(\texttt{drug.F="V"}\) is not significantly different to \(\texttt{drug.F="P"}\).
The model deviance is \(18.643\) following a \(\chi^2_2\) distribution with corresponding p-value of \(p <0.001\). This is highly significant suggesting that this and any model without an interaction term is unlikely to be adequate.
Let use return to a model with interaction, and let’s try merging response \(\texttt{response.F="M"}\) and \(\texttt{"H"}\) – calling them \(\texttt{"HM"}\) (I chose this name so that alphabetically it would be first just as \(\texttt{"H"}\) was first):
Code
# Medium response not significant# Merge response M with response H -- to response HMresponse2 = responseresponse2[response=="H"] ="HM"response2[response=="M"] ="HM"response2.F =as.factor(response2)glm.fit2 =glm(count ~ response2.F*drug.F, family=poisson)summary(glm.fit2)
Call:
glm(formula = count ~ response2.F * drug.F, family = poisson)
Deviance Residuals:
1 2 3 4 5 6
0.0000 0.5676 -0.6135 0.0000 0.8855 -0.9604
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8718 0.2774 6.749 1.49e-11 ***
response2.FL 1.3471 0.3419 3.940 8.17e-05 ***
drug.FV 0.8023 0.3338 2.404 0.0162 *
response2.FL:drug.FV -2.2295 0.5640 -3.953 7.71e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 23.807 on 5 degrees of freedom
Residual deviance: 2.405 on 2 degrees of freedom
AIC: 35.533
Number of Fisher Scoring iterations: 4
Interestingly all terms in this model are significant but let’s check the overall goodness of fit.
The model deviance is \(2.405\) following a \(\chi^2_2\) distribution with corresponding p-value of \(p=0.3\). This is non-significant suggesting that this model is a good fit to the data.
The fitted values are:
Code
predict(glm.fit2, type="response")
1 2 3 4 5 6
25.0 6.5 6.5 6.0 14.5 14.5
and the residuals (which I have rounded for display purposes):
Code
round(residuals(glm.fit2, type="response"),2)
1 2 3 4 5 6
0.0 1.5 -1.5 0.0 3.5 -3.5
The fitted values are a good fit to the data and hence, of course, the residuals are not too large.
The overall conclusion is that the model with response and drug, and their interaction, but with the merging of response levels H and M is an appropriate model. This indicates that the vaccine does have an effect on the antibody response but that it does not influence whether that is a medium or high response.
6.5 Multi-way contingency tables
We can generalize the model notation introduced in Equation 6.1 to accommodate multi-way contingency tables; that is tables of counts indexed by multiple factors. For example, for a 3-way table of cell counts \(y_{ijk}\), the saturated model would be written: \[
Y_{ijk} \sim \text{Po}(\lambda_{ijk})
\] where \[
\log \lambda_{ijk} =
\mu + \alpha_i + \beta_j +\gamma_k
+ (\alpha \beta)_{ij}
+ (\alpha\gamma)_{ik}
+ (\beta \gamma)_{jk}
+ (\alpha \beta \gamma)_{ijk}.
\] Here, \(\alpha_i\), \(\beta_j\) and \(\gamma_j\) are main effects; \((\alpha \beta)_{ij}\), \((\alpha\gamma)_{ik}\), and \((\beta \gamma)_{jk}\) are two-way interaction effects; and \((\alpha \beta \gamma)_{ijk}\) is a three-way interaction.
This approach is easily extended to tables of any number of factors. Note, however, that not all terms need be present in a model, thought 3-way or higher-order interactions might sometimes be required.
A hierarchical model is one in which each term in the model is accompanied by all lower-order terms. For example, including the term \((\alpha\beta\gamma)_{ijk}\) would require inclusion of each of the terms \(\alpha_i, \beta_j, \gamma_k, (\alpha\beta)_{ij}, (\alpha\gamma)_{ik}, (\beta\gamma)_{jk}\). We will be concerned only with hierarchical contingency table models. Non-hierarchical models are sometimes appropriate, depending on the nature of the factors involved, but hierarchical models are more generally applicable and easier to interpret.
6.6 Exercises
6.1 Overdispersion is an occasional problem when fitting generalized linear models with a known scale parameter in situations where there is unexplained variation. In this exercise we illustrate the problem in Poisson regression. The starting point is the observation that if \(Y \sim P(\lambda)\), then \(\mbox{E}[Y] = \lambda\) and \(\mbox{Var}[Y] = \lambda\).
Consider joint random variables \((X,Y)\) where \(X\) takes two possible values with equal probabilities, \(Pr(X=1) = Pr(X=2) = 1/2\).
Suppose the conditional distribution of \(Y\) given \(X\) is Poisson, \[\begin{equation*} \label{eq:pmix}
Y|(X=1) \sim P(\lambda_1), \quad Y|(X=2) \sim P(\lambda_2), \quad \quad \quad (*)
\end{equation*}\] where \(\lambda_1 < \lambda_2\). Let \(\lambda = (\lambda_1 + \lambda_2)/2\) denote the average value. Thus the marginal distribution of \(Y\) is a mixture of two Poisson distributions. Show that \[\begin{equation*}
\mbox{E}[Y] = \lambda, \quad \mbox{Var}[Y] = \lambda + (\lambda_1 - \lambda_2)^2/4,
\end{equation*}\] that is, although the mean of \(Y\) is the same under the mixture (or conditional Poisson) model as under the Poisson model, the variance is larger.
Start with the definition of expectation as a weighted average of the conditional expectations. Then, for the variance, apply a similar result to find the expectation of the square.
This phenomenon might be observed in data as follows. Let \(n=60\) and let an explanatory variable \(x_i\) take the value \(x_i=1\) for \(i=1, \ldots, 30\) and \(x_i=2\) for \(i=31, \ldots, 60\). Suppose that the observations \(y_i | x_i\) come from the above conditional Poisson model \((*)\).
Consider fitting the following two models in R with Poisson errors and a log link function: \[\begin{equation*}
\text{(i)} ~~ y \sim 1, \quad \quad \text{(ii)} ~~ y \sim x.
\end{equation*}\]
Since model (ii) is the correct model, it should yield a good fit to the data. But if the experimenter does not know about the variable \(x\), it will only be feasible to fit model (i). Let \(\overline{Y}\) and \(S^2\) denote the sample mean and variance of the \(Y_i, \ i=1,\ldots,60\). Show that \[\begin{equation*}
\mbox{E}[\overline{Y}] = \lambda, \quad
\mbox{E}[S^2] = \lambda +
\frac{60}{59}(\lambda_1 - \lambda_2)^2/4.
\end{equation*}\]
Hence show that the Pearson \(\chi^2\) goodness of fit statistic for model (i) will indicate a poorly fitting model if \(\lambda_1\) and \(\lambda_2\) are far apart.
For the expectation of the sample variance, first find the expectation of the square of the sample mean. Next, note that the Pearson goodness of fit statistics depends on the ratio of the sample variance divided by the sample mean. Under the conditional Poisson model the variance would be larger but the mean unchanged, hence the statistics gets bigger and so will be further into the upper tail.
This example is very simple, but overdispersion can occur much more widely. Why is overdisperson not a problem for generalized linear models in which the response distribution includes a scale parameter?
In the Poisson, and the Binomial, the problem is caused by the mean and variance being defined by the same model parameter.
6.2 (From Dobson & Barnett, p 163) This question should be done in a computer package such as R. You should think carefully about which variables, if any, to condition on in your analysis: HOME = 1,2,3, CONTACT = 1,2, or SATISFACTION = 1,2,3.
The data relate to an investigation into satisfaction with housing conditions in Copenhagen. Residents of selected areas living in rented houses built between 1960 and 1968 were questioned about their satisfaction and their degree of contact with other residents. The data were tabulated by type of housing. Investigate the associations between satisfaction, contact with other residents and type of housing.
Low Contact:
Satisfaction:
Low
Medium
High
Tower blocks
65
54
100
Apartments
130
76
111
Houses
67
48
62
High Contact:
Satisfaction:
Low
Medium
High
Tower blocks
34
47
100
Apartments
141
116
191
Houses
130
105
104
Produce appropriate tables of percentages to gain initial insights into the data; for example, percentages in each contact level by type of housing and level of satisfaction, or percentages in each level of satisfaction by contact and type of housing.
Consider many 2x2 tables for separate levels of the third variable, Each time re-scaling to create rows or columns, as appropriate, summing to 100%.
Using e.g. R, fit various log-linear models to investigate interactions between the variables.
Define your own variables and make sure the explanatory variables are converted to factor. Including the three-way interaction would lead to the saturated model – why? – and hence only consider pair-wise two-way interactions (as well as the variables separately).
For some model that fits (at least moderately) well, calculate the Pearson residuals and use them to find where the largest discrepancies are between the observed and expected values.