4  GLM Estimation

Here is a short video [2 mins] to introduce the chapter.

Throughout this module we use the principle of maximum likelihood estimation (MLE) to estimate model parameters and will consider two cases.

4.1 The identically distributed case

4.1.1 Maximum likelihood estimation

Suppose we have \(n\) independent and identically distributed (i.i.d.) observations \(\mathbf{y} = \{y_i,\ i=1,\dots,n\}\), where each \(y_i\) is sampled from the same exponential family density \[ f(y_i; \theta,\phi) = \exp \left\{ \frac{y_i\theta - b(\theta)}{\phi} + c(y_i, \phi) \right\}, \tag{4.1}\] for \(i=1,\dots,n.\) In this case, the canonical parameter \(\theta\) does not depend on \(i\) as the observations are identically distributed and hence must have the same parameter.

By independence, the joint distribution of all the observations \(\mathbf{y}\) is: \[ f(\mathbf{y}; \theta,\phi) = \prod_{i=1}^n f(y_i; \theta, \phi). \] So, taking logs and then substituting for the probability function using the exponential family form, Equation 3.3, gives \[ \log f(\mathbf{y}; \theta,\phi) = \sum_{i=1}^n \log f(y_i; \theta, \phi) = \sum_{i=1}^n \left[\frac{y_i\theta - b(\theta)}{\phi} + c(y_i, \phi)\right]. \] Regarding the observations \(\mathbf{y}\) as constants (which they are, once we have data) and the scale parameter \(\phi\) as a fixed nuisance parameter (whose value we may not know), the log-likelihood as a function of the parameter of interest \(\theta\) is: \[ l(\theta; \mathbf{y},\phi) = n\left( \frac{\bar{y}\, \theta - b(\theta)}{\phi}\right) + \mbox{constant}, \tag{4.2}\] where \(\bar{y}= \sum y_i/n.\)

We estimate \(\theta\) by maximizing the log likelihood – i.e. given the data \(\mathbf{y}\), we estimate the value of \(\theta\) to be that value for which the likelihood, and hence the log-likelihood, is greatest.

We maximize the log-likelihood by differentiating it and setting it to zero: \[ \frac{d l(\theta; \mathbf{y},\phi)}{d \theta} = n \left(\frac{\bar{y} - b'(\theta)}{\phi}\right) \] and hence the MLE for \(\theta\), which we denote \(\hat\theta\), satisfies \[ b'(\hat\theta) = \bar{y} \tag{4.3}\] and hence \[ \hat\theta = (b')^{-1}(\bar{y}). \tag{4.4}\]

Further, we showed in Proposition 3.1 that \(\mbox{E}[Y] = \mu =b'(\theta)\) and if we let \(\hat\mu\) denote the MLE of \(\mu\), then \(\hat\mu = b'(\hat{\theta})\)1, hence we have \(\hat\mu = \bar{y}\). So we find that \(\hat\theta\) is the value of \(\theta\) for which the theoretical mean \(\hat\mu = b'(\hat\theta)\) matches the sample mean \(\bar{y}\).

Example: MLE of the Poisson distribution

For the Poisson distribution, \(\text{Po}(\lambda)\), we have found that \(b(\theta) = e^\theta\) and therefore \(b'(\theta) = e^\theta\). Hence, the MLE of natural parameter \(\theta\) is found as the solution of \(b'(\hat\theta) = e^{\hat\theta} = \bar{y}\), that is \(\hat\theta = \log(\bar{y})\).

4.1.2 Estimation accuracy

For our i.i.d. sample \(\mathbf{y} = \{y_i,\ i=1,\dots,n\}\), we have \(b'(\hat\theta) = \hat\mu = \bar y\). Let \(\theta_0\) be the true value of \(\theta\) with corresponding mean \(\mu_0\), i.e. \[ b'(\theta_0) = \mu_0. \tag{4.5}\] How accurate is \(\hat\theta\)? We know that \[ \mbox{E}[\bar Y] = \mbox{E}\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n} \sum_{i=1}^n \mbox{E}[Y_i] = \mu_0 = b'(\theta_0), \tag{4.6}\] using Equation 4.5. Also, \[ \mbox{Var}[\bar Y] = \mbox{Var}\left[\frac{1}{n} \sum_{i=1}^n Y_i\right] = \frac{1}{n^2} \sum_{i=1}^n \mbox{Var}[Y_i] \] because the observations are independent. Then, using the result Equation 3.8 gives \[ \mbox{Var}[\bar Y]= \frac{1}{n} \ b''(\theta_0) \phi. \tag{4.7}\]

We can use Taylor’s theorem to expand \(b'(\hat\theta)\) about \(\theta_0\): \[ \bar y = b'(\hat\theta) \approx b'(\theta_0) + (\hat\theta - \theta_0) b''(\theta_0), \] which implies that \[ (\hat\theta - \theta_0) \approx b''(\theta_0)^{-1}\{b'(\hat\theta) - b'(\theta_0)\} = b''(\theta_0)^{-1}(\bar Y - \mu_0), \tag{4.8}\] using Equation 4.3 and Equation 4.5. We can use Equation 4.8 to get approximations to the mean and variance of \(\hat\theta\): \[ \mbox{E}[\hat\theta - \theta_0] \approx b''(\theta_0)^{-1} \mbox{E}[\bar Y - \mu_0] = 0, \] using Equation 4.6, so \[ \mbox{E}[\hat\theta] \approx \theta_0, \tag{4.9}\] and \[ \mbox{Var}(\hat\theta) \approx \mbox{E}\left[(\hat\theta - \theta_0)^2\right] \] using Equation 4.9, \[ \mbox{Var}(\hat\theta) \approx \mbox{E}\left[\left(b''(\theta_0)^{-1}(\bar Y - \mu_0)\right)^2\right] \] using Equation 4.8, \[ \mbox{Var}(\hat\theta)\approx \left(b''(\theta_0)\right)^{-2} \mbox{Var}[\bar Y] \] using Equation 4.6, \[ \mbox{Var}(\hat\theta) = \frac{\phi}{n \ b''(\theta_0)} \tag{4.10}\] using Equation 4.7.

Thus we see that the first two derivatives of \(b(\theta)\) play a key role in inference.

Example: Accuracy for the Poisson distribution

For the Poisson distribution, \(\text{Po}(\lambda)\), we have found that \(\hat\theta = \log(\bar{Y})\). Using Equation 4.9 we know that \(\hat\theta\) is, at least, approximately unbiased. Then, using that \(\phi=1\) (Table 3.2), \(b'(\theta)=e^{\theta}\) (Table 3.6) hence \(b''(\theta)=e^{\theta}\), and using Equation 4.10, leads to the result \[ \mbox{Var}(\hat\theta) = \frac{\phi}{n \ b''(\theta_0)} = \frac{1}{n e^{\theta_0}}. \]

Focus on maximum likelihood quiz

Test your knowledge recall and application to reinforce basic ideas and prepare for similar concepts later in the Chapter

  1. Which of the following statements best describes the purpose of maximum likelihood estimation?

  2. Which of the following mathematical ideas is NOT used in the above maximum likelihood estimation?

  3. Which of the following terms is used to refer to the fixed parameter Φ?

  4. Which of the following is NOT a good property of a parameter estimator?

  5. Which of the following does NOT influence the variance of the estimator of θ?

4.2 The general case

4.2.1 MLE Estimation

Suppose that now the \(n\) independent observations \(\mathbf{y} = \{y_i,\ i=1,\dots,n\}\) are not identically distributed. They are, however, sampled from the same exponential family density but with differing parameters, that is \[ f(y_i; \theta_i,\phi) = \exp \left\{ \frac{y_i\theta_i - b(\theta_i)}{\phi} + c(y_i, \phi) \right\}, \] for \(i=1,\dots,n.\) In this case, the canonical parameter does depend on \(i\) – but we assume that the scale parameter \(\phi\) does not – and let \(\boldsymbol{\theta} = \{\theta_i,\ i=1,\dots,n\}\).

In most applications, we are not interested in estimation of \(\boldsymbol{\theta}\) but instead we are interested in the linear predictor parameters \(\boldsymbol{\beta} =\{\beta_1,\dots,\beta_p\}\). Note, however, that each \(\theta_i\) will depend on all \(\beta_1,\dots,\beta_p\). This is most obvious for the canonical parameter case where a convenient choice of link function is obtained using \(\boldsymbol{\theta} = \eta = X\boldsymbol{\beta}\), hence \(\theta_i= \mathbf{x}_i^T \boldsymbol{\beta} =\sum_{j=1}^p x_{ij}\beta_{j}\) and each \(\theta_i\) clearly depends on all \(\beta_1,\dots,\beta_p\).

The principle of maximum likelihood will be used to estimate the model parameters \(\boldsymbol{\beta}\). Using the independence of \(\mathbf{y} = \{y_i,\ i=1,\dots,n\}\), given the parameters \(\boldsymbol{\beta}\), the likelihood function is: \[ L(\boldsymbol{\beta}; \mathbf{y}, \phi) = f(\mathbf{y}; X, \boldsymbol{\beta},\phi) = \prod_{i=1}^n f(y_i; \mathbf{x}_i, \boldsymbol{\beta}, \phi) \tag{4.11}\] and the log-likelihood by \[ l (\boldsymbol{\beta}; \mathbf{y}, \phi) = \log L(\boldsymbol{\beta}; \mathbf{y}, \phi) = \sum_{i=1}^n \log f(y_i; \mathbf{x}_i, \boldsymbol{\beta}, \phi). \tag{4.12}\] Then we wish to find the value of \(\boldsymbol{\beta}\) which maximizes the log-likelihood function, \[ \hat{\boldsymbol{\beta}} = \max_{\boldsymbol{\beta}} l (\boldsymbol{\beta}; \mathbf{y}, \phi). \] For generalized linear models there is usually no closed-form expression for the MLE \(\hat{\boldsymbol{\beta}}.\) Instead, an iterative approach based on Newton’s Method is often used.

For the normal linear regression model, however, that is where the exponential family is Gaussian and the link function \(g(\mu)\) is the identity function, we have the familiar closed-form expression \[ \hat{\boldsymbol{\beta}} = \left(X^T X\right)^{-1} X^T \mathbf{y}. \tag{4.13}\]

4.2.2 The score function and Fisher information

We define the score function: \[ U(\boldsymbol{\beta}) = \frac{\partial l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \boldsymbol{\beta}}, \tag{4.14}\] which is a \(p \times 1\) vector. We define the observed Fisher information: \[ {\cal I}(\boldsymbol{\beta}) = - \frac{\partial U(\boldsymbol{\beta})} {\partial \boldsymbol{\beta}^T} = -\frac{\partial^2 l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \boldsymbol{\beta} \, \partial \boldsymbol{\beta}^T}, \tag{4.15}\] which is a \(p \times p\) matrix whose \((j,k)\)th element is: \(-\frac{\partial^2 l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \beta_j\partial \beta_k}\). We also define the expected Fisher information: \[ {\cal J}(\boldsymbol{\beta}) = \mbox{E}\left[ -\frac{\partial^2 l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \boldsymbol{\beta} \, \partial \boldsymbol{\beta}^T} \right], \tag{4.16}\] which is also a \(p \times p\) matrix.

Newton’s Method then involves choosing an initial value for the parameter vector, \(\hat{\boldsymbol{\beta}}_0\) say, and then repeatedly updating, until convergence, the value using \(\hat{\boldsymbol{\beta}}_t = \hat{\boldsymbol{\beta}}_{t-1} - {\cal I}^{-1}(\hat{\boldsymbol{\beta}}_{t-1})\; U(\hat{\boldsymbol{\beta}}_{t-1})\). A related method, called Fisher Scoring, replaces \({\cal I}\) in the iterative step by \({\cal J}\) – calculating the expected Fisher information is more work but does lead to better numerical properties.

Proposition 4.1 With definitions Equation 4.14, Equation 4.15, Equation 4.16 above,

  • \(\mbox{E}\left[U(\boldsymbol{\beta})\right] = 0\),
  • \({\cal J}(\boldsymbol{\beta}) = \mbox{E}\left[U(\boldsymbol{\beta}) U^T(\boldsymbol{\beta})\right]\).

Proof We give the proof for continuous random variables. For the discrete case, replace integration by sums – see Exercises.

To start the proof, notice that we can re-write the joint density of the data given the parameters as \[ f(\mathbf{y}; X, \boldsymbol{\beta},\phi) = L(\boldsymbol{\beta}; \mathbf{y}, \phi) =\exp\left\{ l(\boldsymbol{\beta}; \mathbf{y},\phi)\right\} \] and then \[ 1 = \int f(\mathbf{y}; X \boldsymbol{\beta},\phi) d\mathbf{y} = \int \exp\{ l(\boldsymbol{\beta}; \mathbf{y},\phi)\} d\mathbf{y}, \] where \(d\mathbf{y}=dy_1\cdots dy_n\). Differentiating this with respect to \(\boldsymbol{\beta}=(\beta_1,\dots,\beta_p)^T\) gives: \[\begin{align*} 0 & = \int \frac{\partial l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \boldsymbol{\beta}} \exp\{ l(\boldsymbol{\beta}; \mathbf{y},\phi)\} d\mathbf{y} \\ & = \int U(\boldsymbol{\beta}) f(\mathbf{y}; X, \boldsymbol{\beta},\phi) d\mathbf{y} \tag{$\star$} \\ &= \mbox{E}\left[U(\boldsymbol{\beta})\right]. \end{align*}\] Proving the first part.

Next, differentiating \((\star)\) by parts with respect to  \(\boldsymbol{\beta}^T\), \[\begin{align*} 0 & = \int \frac{\partial U(\boldsymbol{\beta})} {\partial \boldsymbol{\beta}^T} f(\mathbf{y}; X, \boldsymbol{\beta},\phi) + \frac{\partial l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \boldsymbol{\beta}} \frac{\partial l(\boldsymbol{\beta}; \mathbf{y},\phi)} {\partial \boldsymbol{\beta}^T} f(\mathbf{y}; X, \boldsymbol{\beta},\phi) d\mathbf{y} \\[2mm] & = \int - {\cal I}(\boldsymbol{\beta}) f(\mathbf{y}; X, \boldsymbol{\beta},\phi) d\mathbf{y} + \int U(\boldsymbol{\beta}) U^T(\boldsymbol{\beta}) f(\mathbf{y}; X, \boldsymbol{\beta},\phi) d\mathbf{y} \\[2mm] & = - {\cal J}(\boldsymbol{\beta}) + \mbox{E}\left[U(\boldsymbol{\beta}) U^T(\boldsymbol{\beta})\right]. \end{align*}\] proving the second part.

Proposition 4.2 Under some regularity conditions, the MLE \(\hat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\) has the following asymptotic properties:

  • \(\mbox{E}(\hat{\boldsymbol{\beta}}) = \boldsymbol{\beta}\); i.e. \(\hat{\boldsymbol{\beta}}\) is unbiased for \(\boldsymbol{\beta}\).
  • \(\mbox{Var}(\hat{\boldsymbol{\beta}}) = {\cal J}^{-1}(\boldsymbol{\beta})\).
  • \(\hat{\boldsymbol{\beta}}\) follows a \(p\)-dimensional Normal distribution.

Combining these three statements we have that asymptotically \[ \hat{\boldsymbol{\beta}} \sim N_p (\boldsymbol{\beta}, {\cal J}^{-1}(\boldsymbol{\beta})). \tag{4.17}\]

Proof: Omitted. Similar to the proof using Taylor’s theorem in Section 4.1.

From Equation 4.17, variances \(\text{Var}(\hat{\beta}_{k})\), standard errors \(\text{se}(\hat{\beta}_{k})\) and correlations between parameter estimates \(\text{Corr}(\hat{\beta}_{k},\hat{\beta}_{h})\) can be estimated.

4.2.3 The saturated case

Again we assume the observations \(y_i,\ i=1,\dots,n\) are independent but now we assume that \(y_i\) is sampled from an exponential family probability function with canonical parameter \(\theta_i\), \[ f(y_i; \theta_i,\phi) = \exp \left\{ \frac{y_i\theta_i - b(\theta_i)}{\phi} + c(y_i, \phi) \right\}, \] for \(i=1,\dots,n.\) We can form the log-likelihood in the usual way to give \[ l (\boldsymbol{\theta}; \mathbf{y}, \phi) = \sum_{i=1}^n \log f(y_i; \theta_i, \phi) = \sum_{i=1}^n \left[\frac{y_i\theta_i - b(\theta_i)}{\phi} + c(y_i, \phi)\right] \] and so we find the the MLE of \(\boldsymbol{\theta}\) using \[ \hat \theta_i = (b')^{-1}(y_i), \quad i=1,\dots,n. \] Note that each parameter is only a function of the corresponding observation.

Further, from Proposition 3.1, \(\mbox{E}[Y_i] = \mu_i =b'(\theta_i)\) and if we again let \(\hat\mu_i\) denote the MLE of \(\mu_i\), then \(\hat\mu_i = b'(\hat{\theta_i})\), hence we have \(\hat\mu_i = y_i\). Thus we see that, under the saturated model, the mean of the distribution of \(y_i\) is estimated to be equal to \(y_i\) itself. That is, the data are fitted exactly by the model. Of course this model is quite useless for explanation or prediction, since it misinterprets random variation as systematic variation. Nevertheless, the saturated model is useful as a benchmark for comparing models, as we will see later.

It is worth noting that the same situation can occur even when modelling in terms of the regression parameters \(\beta_j,\ j=1,\dots,p\). When \(p\geq n\), and if the covariates are linearly independent, each \(\theta_i\) can take on any value independently of the others and so estimating the \(\beta_j\)’s is equivalent to estimating the \(\theta_i\)’s. That is we are also considering the saturated or full model. This highlights the danger of putting too many covariates into the model. There is a big literature on how to deal with more parameters then data using techniques of regularized regression.

Focus on general maximum likelihood quiz

Test your knowledge recall and application to reinforce basic ideas and prepare for similar concepts later in the Chapter

  1. Which of the following statements best describes the purpose of maximum likelihood estimation?

  2. Which of the following mathematical ideas is NOT used in the general case of maximum likelihood estimation?

  3. Which of the following terms is NOT used in maximum likelihood theory?

  4. Which of the following is a true statement about the saturated case?

  5. Which of the following is NOT a true statement about the asymptotic properties of the maximum likelihood estimator of β?

4.3 Model deviance

The deviance is a quantity we use to assess the fit of a model to the data. Let \(M\) be a model of interest with fitted parameters \(\hat{\boldsymbol\theta}\) and corresponding fitted values \(\hat{\boldsymbol\mu}\). Also consider the saturated model with fitted parameters \(\tilde{\boldsymbol\theta}\) and fitted values \(\tilde{\boldsymbol\mu}\).

The deviance of model \(M\) is defined as twice the difference between the log-likelihood of the saturated model, \(l(\tilde{\boldsymbol\theta};\mathbf{y},\phi)\), and the log-likelihood of model \(M\), \(l(\hat{\boldsymbol\theta}; \mathbf{y},\phi)\), multiplied by \(\phi\), \[\begin{eqnarray} D & = & 2 \phi \left\{ l(\tilde{\boldsymbol\theta};\mathbf{y},\phi) - l(\hat{\boldsymbol\theta}; \mathbf{y},\phi) \right\} \label{eq:deviance}\\[4mm] & = & \left\{ \begin{array}{ll} \sum_{i=1}^n (y_i - \hat\mu_i)^2 = \mbox{Residual sum of squares} & \mbox{ Normal} \\[3mm] 2 \sum_{i=1}^n \left\{ y_i \log \left( \frac{y_i}{\hat\mu_i} \right) + (m_i - y_i) \log \left( \frac{m_i - y_i}{m_i - \hat\mu_i} \right) \right\} & \mbox{ Binomial} \\[3mm] 2 \sum_{i=1}^n \left\{ y_i \log \left( \frac{y_i}{\hat\mu_i} \right) - y_i + \hat\mu_i \right\} & \mbox{ Poisson} \\ \end{array} \right. \notag \end{eqnarray}\]

Note that Dobson, and others, call \(D^*= D/\phi=2 \{ l(\tilde{\theta};y,\phi) - l(\hat{\theta}; y,\phi)\}\) the scaled deviance.

We now consider two situations:

Scale parameter \(\phi\) known. For some data-types (e.g. Poisson, Binomial), we know \(\phi=1\). Consider two nested models \(M_1\) and \(M_2\) with \(r_1\) and \(r_2\) parameters respectively where the parameters in \(M_1\) are a subset of those in \(M_2\) and hence \(r_1 < r_2\). Further, let \(D_1\) and \(D_2\) be the deviances of model \(M_1\) and \(M_2\) respectively.

Then, asymptotically,

  • the log likelihood-ratio statistic \(D_1 - D_2 \sim \chi^2_{r_2 - r_1}\) can be used to test the importance of the extra parameters in \(M_2\) not included in \(M_1;\)

  • a goodness-of-fit test for \(M_2\) can be done based on \(D_2 \sim \chi^2_{n - r_2}\).

The quality of the approximations involved depends on there being a large amount of information, for example, large counts for Binomial and Poisson data, or a large sample size for Normal data.

Scale parameter \(\phi\) unknown. For some data-types (e.g. Normal, Gamma), \(\phi\) is not known (typically \(\phi = \sigma^2\)). We must find a model \(M_3\) big enough to be believed, then estimate \(\phi\) by the residual mean square: \[ \hat{\phi} = \frac{D_3}{n - r_3}. \tag{4.18}\] Then test \(M_1\) against \(M_2\) using \[ \mbox{F}=\frac{(D_1 - D_2) / (r_2 - r_1)}{\hat{\phi}} = \frac{(D_1 - D_2) / (r_2 - r_1)}{D_3/(n-r_3)} \tag{4.19}\] with \[ \mbox{F} \sim F_{r_2 - r_1, n - r_3}. \tag{4.20}\] So if the observed value of the statistic Equation 4.19 was within the upper (say 5%) tail of the \(F\)-distribution Equation 4.20, we would infer that Model \(M_2\) is better than Model \(M_1\).

4.4 Model residuals

Consider a generalized linear model with observed values \(y_i, i=1,\dots,n\) and fitted values \(\hat\mu_i\). Then the raw or response residuals are defined by \[ e_i^\text{raw} = y_i - \hat\mu_i. \]

More useful are the standardized or Pearson residuals defined by \[ e_i^\text{std} = e_i^\text{P} = \frac{y_i - \hat\mu_i}{\sqrt{ b''(\theta_i)}}. \] Recall from Equation 3.8 that \(\text{Var}(Y_i) = \phi \, b''(\theta_i)\).

Deviance residuals are defined so that the sum of squared deviance residuals equals the total deviance. Thus we set \[ e_i^\text{dev} = \text{sign}(y_i - \hat\mu_i) \sqrt{d_i}, \] where \(d_i\) is the contribution of observation \(i\) to the deviance, \(D\). For example, when \(y_i\) has a Poisson distribution with estimated mean \(\hat{\mu}_i\), we have \[ e_i^\text{dev} = \text{sign}(y_i - \hat\mu_i) \sqrt{2\left[y_i \log\left(\frac{y_i}{\hat{\mu}_i}\right) - y_i + \hat{\mu}_i\right]}. \]

Residuals are useful for assessing the overall fit of a model to the data, and for identifying where the model might need to be improved.

4.5 Fitting generalized linear models in R

4.5.2 Example of fitting Poisson GLM in R

Here is a toy example of R commands for modelling a response in terms of two qualitative explanatory variables (that is factors). The model assumes the data are Poisson-distributed and uses the logarithmic link function.

Code
y = c(1, 2, 4, 7, 8, 10, 10, 7, 10, 2, 8, 16)
a = rep(1:4,times=3)
b = rep(1:3,each=4)
a = as.factor(a)
b = as.factor(b)

y.glm = glm(y ~ a+b, poisson)

summary(y.glm, correlation=T)

Call:
glm(formula = y ~ a + b, family = poisson)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1408     0.3351   3.404 0.000663 ***
a2           -0.3054     0.3522  -0.867 0.385934    
a3            0.1466     0.3132   0.468 0.639712    
a4            0.4568     0.2932   1.558 0.119269    
b2            0.9163     0.3162   2.898 0.003761 ** 
b3            0.9445     0.3150   2.999 0.002712 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 31.725  on 11  degrees of freedom
Residual deviance: 13.150  on  6  degrees of freedom
AIC: 68.227

Number of Fisher Scoring iterations: 5

Correlation of Coefficients:
   (Intercept) a2    a3    a4    b2   
a2 -0.45                              
a3 -0.50        0.48                  
a4 -0.54        0.51  0.57            
b2 -0.67        0.00  0.00  0.00      
b3 -0.68        0.00  0.00  0.00  0.72
Code
anova(y.glm)
Analysis of Deviance Table

Model: poisson, link: log

Response: y

Terms added sequentially (first to last)

     Df Deviance Resid. Df Resid. Dev
NULL                    11     31.725
a     3   6.2793         8     25.445
b     2  12.2947         6     13.150
Code
deviance(y.glm)
[1] 13.15047
Code
df.residual(y.glm)
[1] 6
Code
fitted.values(y.glm)
        1         2         3         4         5         6         7         8 
 3.129412  2.305882  3.623529  4.941176  7.823529  5.764706  9.058824 12.352941 
        9        10        11        12 
 8.047059  5.929412  9.317647 12.705882 

Focus on GLM fitting in R quiz

Test your knowledge recall and application to reinforce basic ideas and prepare for similar concepts later in the Chapter

  1. Which of the following is the main R command used to fit generalised linear models?

  2. Which of the following is NOT an option for the model family?

  3. Which of the following is NOT a link option in a GLM?

  4. Which of the following is NOT a link option with the binomial family?

  5. Which of the following is NOT a n R command used to check model fit?

4.6 Exercises

4.1 Use Equation 4.4 to obtain estimation equations for the natural parameter \(\theta\), based on a sample \(\mathbf{y}=\{y_1,\dots, y_n\}\), for each of the following situations:

  1. the binomial, \(Y\sim\mbox{Bin}(m,p)\),
  2. the geometric, \(Y\sim\mbox{Ge}(p)\),
  3. the exponential, \(Y\sim\mbox{Exp}(\lambda)\).
Assuming that \(n\) is very large, use Equation 4.9 to comment on possible bias of the estimator for large samples. What is the corresponding variance of the estimator?

For each situation equate the theoretical expectation, in terms of the derivative of b, to the sample mean – that is start from Equation 4.3 and solve. For bias and variance you should consider the asymptotic results.

4.2 For a sample of size \(n\) from the normal distribution, \(Y\sim N(\mu, \sigma^2),\) how do the results produced using Equation 4.9 and Equation 4.10 compare with the familiar results \(\hat\mu =\bar Y\), \(\text{E}[\hat\mu]=\mu\), and \(\text{Var}[\hat\mu] =\sigma^2/n?\)

Identify, \(\theta\), \(\phi\) and \(b''(\theta)\) then substitute into the equations.

4.3 For the normal linear regression model, \(\mathbf{Y}=X\boldsymbol{\beta}+\boldsymbol{\epsilon}\) where \(\boldsymbol{\epsilon}\sim N_n(0,\sigma^2 I_n)\) use the principle of maximum likelihood to show that the MLE has the closed form given in Equation 4.13.

Apply the rules for matrix differentiation of a quadratic form. If needed, refer to Appendix C in the Basic Pre-requisite Material folder in Minerva.

4.4 Check that you can derive the formulas given for deviance of normal, binomial, and Poisson models at the start of Section 4.3.

Form the likelihoods of the saturated model, recalling that the fitted values are exactly equal to the observed \(y\) values, and substitute them into the general deviance equation. Then, simplify.

4.5 How do the deviance tests defined in Section 4.3 related to the tests used to find the best fit model for the birthweight example in Section 2.1?

Start by comparing the notation. Then, replace the R and r in the F-statistic with D and adjust the notation for degrees of freedom. You should then get very expressions. The only difference being from which model \(\sigma^2\) is estimated.

4.6 Consider a clinical trial into the effectiveness of Amoxicillin (a widely used antibiotic) against bacterial chest infections in children. A sample of 200 children with chest infections were randomly assigned to one of five groups with Amoxicillin dose levels of 20-100 mg/kg per day. Four weeks later it was recorded whether the child had required a re-treatment with a further dose of antibiotic, or not, with results summarized in Table 4.1

Table 4.1: Effectiveness of Amoxicillin against bacterial chest infections in children
20mg 40mg 60mg 80mg 100mg
Group size 23 18 14 23 22
Re-treatment 20 14 5 6 3
First fit a normal linear model and then fit a generalized linear model assuming binomial errors and a logistic link. Superimpose both of fitted model onto a scatter plot of the data. Which do you think is the better model? Justify you answer.

The linear model part would be straightforward, look at earlier examples if not. For the glm case, check the R code block for Figure 1.1 (b). Simply compare the fit by eye.



  1. Using the result that the MLE of any function of a parameter is given by the same function applied to the MLE of the parameter.↩︎