3  GLM Theory

Here is a short video [3 mins] to introduce the chapter

3.1 Motivating examples

We cannot always assume that the dependent variable \(Y\) is normally distributed. For example, for the beetle mortality data in Table 1.1, suppose each beetle subjected to a dose \(x_i\) has a probability \(p_i\) of being killed. Then, the number of beetles killed \(Y_i\) out of a total number \(m_i\) at dose-level \(x_i\) will have a \(\text{Bin}(m_i,p_i)\) distribution with probability mass function

\[ \text{Pr}(y_i ;~ p_i,m_i) = \left(\begin{array}{c} m_i\\ y_i \end{array} \right) p_i^{y_i} (1-p_i)^{m_i-y_i} \tag{3.1}\] where \(y_i\) takes values in \(\{0,1,\dots,m_i\}\).

Table 3.1 contains seasonal data on tropical cyclones for 13 seasons. Suppose that, within season \(i\), there is a constant probability \(\lambda_i dt\) of a cyclone occurring in any short time-interval \(dt\). Then the total number of cyclones \(Y_i\) during season \(i\) will have a Poisson distribution with mean \(\lambda_i\), that is \(Y_i\sim \text{Po}(\lambda_i)\), with probability mass function \[ \text{Pr}(y_i ;~ \lambda_i) = \frac{\lambda_i^{y_i} e^{-\lambda_i} }{y_i!} \tag{3.2}\] where \(y_i\) takes values in \(\{0,1,2,\dots\}\).

Table 3.1: Numbers of tropical cyclones in \(n = 13\) successive seasons1
Season 1 2 3 4 5 6 7 8 9 10 11 12 13
No of cyclones 6 5 4 6 6 3 12 7 4 2 6 7 4

In these two examples, we have non-normal data and would like to know whether and how the dependent variable \(Y_i\) depends on the covariate dose or season.

Code
par(mar=c(4,4,0,1), mgp=c(2,1,0))

beetle = read.table("https://rgaykroyd.github.io/MATH3823/Datasets/beetle.txt", header=T)

dose = beetle$dose
mortality = beetle$died/beetle$total

plot(dose, mortality, pch=16,
     xlim=c(1.65, 1.90), xlab ="Dose",
     ylim=c(-0.1, 1.1),  ylab="Mortality")
abline(h=c(0,1), lty=2)

season = 1:13
cyclones = c(6,5,4,6,6,3,12,7,4,2,6,7,4)

plot(season, cyclones, pch=16,
     ylim=c(0,12),
     xlab="Season", ylab="No. of cyclones")

(a) Binomial model

(b) Poisson model

Figure 3.1: Examples of non-normally distributed data.

Generalized linear models provide a modelling framework for data analysis in the non-normal setting. We will revisit the beetle mortality and cyclone data sets after describing the structure of a generalized linear model.

Focus on distributions quiz

Test your knowledge recall and application to reinforce basic ideas and prepare for similar concepts later in the Chapter.

  1. Which of the following best describes the sample space of a binomial random variable?

  2. Which of the following best describes the sample space of a Poisson random variable?

  3. Which of the following best describes the sample space of a normal random variable?

  4. For the binomial distribution, which of the following best describes the range of values possible for log(p/(1-p))?

  5. For the Poisson distribution, which of the following best describes the range of values possible for log(λ)?

  6. For the normal distribution, which of the following best describes the range of values possible for µ?

3.2 The GLM structure

A generalized linear model relates a continuous or discrete response variable \(Y\) to a set of explanatory variables \(\mathbf{x}=(x_1, \ldots, x_p)\). The model contains three parts:

Random part: The probability (mass or density) function of \(Y\) is assumed to belong to the two-parameter exponential family of distributions with parameters \(\theta\) and \(\phi:\)

\[ f(y; \theta, \phi) = \exp \left\{ \frac{y \theta - b(\theta)} {\phi} + c(y, \phi) \right\}, \tag{3.3}\] where \(\phi>0\). Here, \(\theta\) is called the canonical or natural parameter of the distribution and \(\phi\) is called the scale parameter. We show below that the mean \(\mbox{E}[Y]\) depends only on \(\theta\), and \(\mbox{Var}[Y]\) depends on \(\phi\) and possibly also \(\theta\). Various choices for functions \(b(\cdot)\) and \(c(\cdot)\) produce a wide variety of familiar distributions (see below). Sometimes we may set \(\phi=1\); then Equation 3.3 is called the one-parameter exponential family.

Most of the common statistical distributions are member of the regular exponential family, such as binomial, Poisson, geometric, gamma, normal, but not the Student’s t and uniform with unknown bounds.

Further, note that in some references to generalized linear models (such as Dobson and Barnett, 3rd edn.), \(\phi\) does not appear at all in the exponential family formula Equation 3.3, instead it is absorbed into \(\theta\) and \(b(\theta)\).

In this module, we will generally assume that each observation \(Y_i\), \(i=1,\dots,n,\) is independently drawn from an exponential family where \(\theta\) depends on the covariates. Thus we write \[ f(y_i; \theta_i, \phi) = \exp \left\{ \frac{y_i \theta_i - b(\theta_i)} {\phi} + c(y_i, \phi) \right\}. \] Note the subscripts on both \(y\) and \(\theta\), and hence the observations may not be identically distributed.

Systematic part: This is a linear predictor: \[ \eta = \sum_{j=1}^p \beta_j x_j. \tag{3.4}\] Note that the symbol \(\eta\) is pronounced eta.

Link function: This is a one-to-one function providing the link between the linear predictor \(\eta\) and the mean \(\mu = \mbox{E}[Y]\):

\[ \eta = g(\mu), \quad \mbox{and} \quad \mu = g^{-1}(\eta) = h(\eta). \tag{3.5}\]

Here, \(g(\mu)\) is called the link function, and \(h(\eta)\) is called the inverse link function.

We will now discuss each of these parts in more detail.

Focus on GLM structure quiz

Test your knowledge recall and application to reinforce basic ideas and prepare for similar concepts later in the Chapter

  1. What symbols are usually used for the parameters in the definition of the two-parameter exponential family?

  2. Which of the following is NOT a member of the two-parameter exponential family?

  3. Which of the following is NOT part of the usual definition of a generalised linear model?

  4. In the definition of a GLM, which of the following best describes an assumption about the distribution of the response variable?

  5. Which part of the GLM definition involves the mean of the response variable and the explanatory variables?

3.3 The random part of a GLM

We begin with some examples of exponential family members.

Example: Poisson distribution

If \(Y\) has a Poisson distribution with parameter \(\lambda\), that is \(Y \sim \text{Po}(\lambda)\), then \(Y\) takes values in \(\{0,1,2,\dots\}\) and has probability mass function: \[ f(y) = \frac{e^{-\lambda} \lambda^y} {y!} = \exp \left\{y \log \lambda - \lambda - \log y! \right\}, \tag{3.6}\] which has the form of Equation 3.3 with components as in Table 3.2.

Table 3.2: Exponential model components for the Poisson
\(\theta\) \(\phi\) \(b(\theta)\) \(c(y,\phi)\)
\(\log\lambda\) \(1\) \(\lambda=e^\theta\) \(-\log y!\)

For example, to model the cyclone data in Table 3.1, we might simply assume that the number of cyclones in each season has a Poisson distribution, assuming a constant rate \(\lambda\) across all seasons \(i\). That is \(Y_i \sim \text{Po}(\lambda).\) The parameter would be simply estimated by the sample mean, \(\hat\lambda=\bar y=\) 5.5384615.

Code
par(mar=c(4,4,0,1), mgp=c(2,1,0))

season = 1:13
cyclones = c(6,5,4,6,6,3,12,7,4,2,6,7,4)
n = length(cyclones)

plot(season, cyclones, pch=16,
     ylim=c(0,12),
     xlab="Season", ylab="No. of cyclones")
abline(h=mean(cyclones), lty=2)

hist(cyclones, breaks=1:13-0.5, main="")
fitted=n*dpois(0:12,mean(cyclones))
lines(0:12, fitted, type='h', lwd=3)

(a) No. of cyclones against season with mean

(b) Fitted Poisson model assuming constant rate

Figure 3.2: Poisson model fitted to cyclone data.

Example: Binomial distribution

Let \(Y\) have a Binomial distribution, that is \(Y \sim \text{Bin}(m, p)\), with \(m\) fixed. Then \(Y\) is discrete, taking values in \(\{0,1,\dots,m\}\), and has probability mass function: \[ f(y) = {m \choose y} p^y (1 - p)^{m - y} = {m \choose y} \left(\frac{p}{1-p} \right)^y (1 - p)^m \] which can be re-written as \[ f(y) = \exp \left\{ y\ \mbox{logit } p + m \log (1-p) + \log {m \choose y}\right\}, \tag{3.7}\] which has the form of Equation 3.3 with, \[ \theta=\text{logit} \ p = \log \left( \frac{p}{1-p}\right), \] and with components as in Table 3.3.

Table 3.3: Exponential model components for the Binomial
\(\theta\) \(\phi\) \(b(\theta)\) \(c(y,\phi)\)
\(\mbox{logit }p\) \(1\) \(m\log(1+e^\theta)\) \(\log{m\choose y}\)

Note that it can be shown that \(-m\log(1-p)=m\log(1+e^\theta)\) – see Exercises.

Code
beetle = read.table("https://rgaykroyd.github.io/MATH3823/Datasets/beetle.txt", header=T)

dose = beetle$dose
mortality = beetle$died/beetle$total

plot(dose, mortality, pch=16,
     xlim=c(1.65, 1.90), xlab ="Dose",
     ylim=c(-0.1, 1.1),  ylab="Mortality")
abline(h=c(0,1), lty=2)

y = cbind(beetle$died, beetle$total-beetle$died)
glm.fit = glm(y ~ dose, family=binomial(link='logit'))

output.dose = seq(1.6,1.95,0.001)
fitted = predict(glm.fit, data.frame(dose=output.dose), type="response")
lines(output.dose, fitted)

Figure 3.3: Binomial model fitted to beetle data.

Example: Normal distribution

Let \(Y\) have a Normal distribution with mean \(\mu\) and variance \(\sigma^2\), that is \(Y\sim N(0, \sigma^2)\). Then \(Y\) takes values on the whole real line and has probability density function

\[\begin{align*} f(y; \mu, \sigma^2) &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{ \frac{-1}{2\sigma^2} (y - \mu)^2 \right\}, \notag \\ &= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left\{-\frac{y^2}{2 \sigma^2} + \frac{y\mu}{\sigma^2} - \frac{\mu^2}{2 \sigma^2}\right\}\\ &= \exp \left\{ \frac{y \mu - \mu^2/2}{\sigma^2} + \left[\frac{-y^2}{2\sigma^2} - \frac{1}{2} \log (2 \pi \sigma^2) \right]\right\}, \end{align*}\] which has the form of Equation 3.3 with components as in Table 3.4.

Table 3.4: Exponential model components for the Gaussian
\(\theta\) \(\phi\) \(b(\theta)\) \(c(y,\phi)\)
\(\mu\) \(\sigma^2\) \(\theta^2/2\) \(-\frac{y^2}{2\phi} - \frac{1}{2} \log (2 \pi \phi)\)

From the usual regression point of view, we write \(y = \alpha + \beta x + \epsilon\), with \(\epsilon \sim N(0, \sigma^2)\). From the point of view of a generalized linear model, we write \(Y \sim N(\mu, \sigma^2)\) where \(\mu(x) = \alpha + \beta x\).

3.4 Moments of exponential-family distributions

It is straightforward to find the mean and variance of \(Y\) in terms of \(b(\theta)\) and \(\phi\). Since we want to explore the dependence of \(\mbox{E}[Y]\) on explanatory variables, this property makes the exponential family very convenient.

Proposition 3.1 For random variables in the exponential family: \[ \mbox{E}[Y] = b'(\theta), \quad \mbox{and } \quad \mbox{Var}[Y] = b''(\theta)\phi. \tag{3.8}\]

Proof We give the proof for a continuous random variables. For the discrete case, replace all integrals by sums – see Exercises.

Starting with the simple property that all probability density functions integrate to 1, we have \[ 1 = \int \exp \left\{ \frac{y \theta - b(\theta)}{\phi} + c(y,\phi)\right\} dy \] and then differentiating both sides with respect to \(\theta\) gives \[ 0 = \int \left[\frac{ y - b'(\theta)}{\phi} \right]\exp \left\{ \frac{y \theta - b(\theta)}{\phi} + c(y,\phi)\right\}\ dy. \tag{3.9}\] Next, using the definition of the exponential family to simplify the equation gives \[ 0 = \int \left[\frac{ y - b'(\theta)}{\phi} \right] f(y; \theta)\ dy \] and expanding the brackets leads to \[ 0 = \frac{1}{\phi} \left(\int y f(y; \theta) dy - b'(\theta) \int f(y;\theta)\ dy \right). \] The first integral is simply the expectation of \(Y\) and the second is the integral of the probability density function of \(Y\), and hence \[ 0 = \frac{1}{\phi} \left(\mbox{E}[Y] - b'(\theta)\right) \] which implies that \[ \mbox{E}[Y] = b'(\theta), \tag{3.10}\] which proves the first part of the proposition.

Differentiating Equation 3.9 by parts and then using the definition of the exponential family to simplify again yields \[ 0 = \int \left\{ -\frac{b''(\theta)}{\phi} + \left[\frac{ y - b'(\theta)}{\phi} \right]^2 \right\} f(y; \theta)\ dy \] and using Equation 3.10 gives, \[ 0 = -\frac{b''(\theta)}{\phi} +\int \left[\frac{ y - \mbox{E}[Y]}{\phi} \right]^2 f(y; \theta)\ dy \] \[ 0 = -\frac{b''(\theta)}{\phi} + \frac{\mbox{Var}[Y]}{\phi^2} \] which implies that \[ \mbox{Var}[Y] = \phi \ b''(\theta). \] which proves the second part of the proposition.

Together, these two results allow us to write down the expectation and variance for any random variable once we have shown that it is a member of the exponential family.

Example: Poisson and normal distribution moments

Table 3.5: Summary of moment calculations via exponential family properties
\(\theta\) \(b(\theta)\) \(\phi\) \(\mbox{E}[Y]=b'(\theta)\) \(b''(\theta)\) \(\mbox{Var}[Y]=b''(\theta)\phi\)
Poisson, \(Po(\lambda)\) \(\log \lambda\) \(e^\theta\) \(1\) \(e^\theta=\lambda\) \(e^\theta\) \(e^\theta\times 1=\lambda\)
Normal, \(N(\mu,\sigma^2)\) \(\mu\) \(\theta^2/2\) \(\sigma^2\) \(\theta=\mu\) \(1\) \(1\times \sigma^2=\sigma^2\)

3.5 The systematic part of the model

The second part of the generalized linear model, the linear predictor, is given in as \(\eta = \sum_{j=1}^p \beta_j x_j\), where \(x_j\) is the \(j\)th explanatory variable (with \(x_1=1\) for the intercept). Now, for each observation \(y_i,\ i=1,\dots,n\), the explanatory variables may differ. To make explicit this dependence on \(i\), we write: \[ \eta_i = \sum_{j=1}^p \beta_j x_{ij}, \tag{3.11}\] where \(x_{ij}\) is the value of the \(j\)th explanatory variable on individual \(i\) (with \(x_{i1}=1\)). Rewriting this in matrix notation: \[ \eta = X \beta, \tag{3.12}\] where now \(\boldsymbol{\eta} = (\eta_1,\dots,\eta_n)\) is a vector of linear predictor variables, \(\boldsymbol{\beta} = (\beta_1,\dots,\beta_p)\) is a vector of regression parameters, and \(X\) is the \(n\times p\) design matrix.

Recall from Section 1.4 and Section 2.3 that we are concerned with two kinds of explanatory variable:

  • Quantitative — for example, \(x \in (-\infty, \infty)\) etc.
  • Qualitative — for example, \(x \in \{A, B, C\}\) etc.

As discussed in Section 2.4, each quantitative variable is represented in \(X\) by an \(n \times 1\) column vector. Each qualitative variable, with \(k+1\) levels, say, is represented by a dummy \(n \times k\) matrix (one column, usually the first, being dropped to avoid identification problems) of 0’s and 1’s .

3.7 Exercises

3.1 Consider the beetle data again, see Table 1.1, but suppose that we had only been given the \(y\) values, that is the number killed, and misinformed that each came from a sample of size \(62\). Further, suppose that we did not know that different doses had been used. That is, we where given data: \(\mathbf{y}=\{6, 13, 18, 28, 52, 53, 61, 60\}\) and led to believe that the model \(Y\sim \mbox{Bin}(62,p)\) was appropriate. Use the given data to estimate \(p\). Then, calculate the fitted probabilities and superimpose them on a histogram of the data.

See the code chunk used to create Figure 3.2.

3.2 Verify that, in general, if \(q = 1/(1+e^{-x})\) then \(x = \log(q/(1-q))\) and then for the binomial distribution, \(Y\sim \mbox{Bin}(m,p)\), show that \[ -m\log(1-p)=m\log(1+e^\theta) \] where \(\theta=\mbox{logit }p = \log(p/(1-p))\).

Each deviation requires algebraic rearrangement and simplification only.

3.3 Suppose that \(Y\) has a gamma distribution with parameters \(\alpha\) and \(\beta\), that is \(Y\sim \mbox{Gamma}(\alpha, \lambda)\), with probability density function \[ f(y;\alpha, \lambda) = \frac{\lambda^\alpha y^{\alpha-1} e^{-\lambda y}}{\Gamma(\alpha)} \quad \qquad y > 0; \alpha,\lambda>0. \] Write this in the form of the exponential family and clearly identify \(\theta\), \(\phi\), \(b(\theta)\) and \(c(y,\phi)\) – as was done for the Poisson and binomial in Table 3.2 and Table 3.3.

This is a difficult one and you might need to try a few rearrangements. Consider treating \(1/\alpha\) as a scale parameter and let \(\theta\) be a suitable function of \(\lambda\) and \(\alpha\).

3.4 Express the geometric distribution, \(Y\sim\mbox{Geom}(p), 0 < p <1\), with probability mass function \[ f(y) = (1 - p)^{y-1} p; \hspace{1cm} y = 1, 2, 3 \ldots \] as an exponential family distribution.

Follow the same process as usual and you might get an unexpected \(c(y,\phi)\)!

3.5 Prove Proposition 3.1 assuming that \(Y\) follows a discrete distribution. Use the proposition, starting from the given \(b(\theta)\), to verify the results for expectation and variance of the Poisson in Table 3.5 and then derive similar results for the binomial, \(Y\sim \mbox{Bin}(m,p)\).

Replace the integration in the continuous case by summation for the discrete – we know that the sum of the probabilities for a discrete random variable equals 1. Remember, however, that \(\theta\) is still a real number and so differentiating with respect to \(\theta\) still makes sense. For the binomial, go through the steps of finding \(b(\theta)\) and then it’s first and second derivative.

3.6 Use the properties of exponential families to find the expectation and variance of each of the geometric and gamma distributions considered above.

Use the results from question 3.3 and 3.4, and apply Proposition 3.1.

3.7 Using Equation 3.20, verify the canonical link function, \(g(\mu)\), for the binomial and gamma distributions shown in Table 3.6.

Use the appropriate \(b'(\theta)\) and, using \(\mu=b'(\theta)\) and Equation 3.20, rearrange to find \(\theta\) and hence \(g(\mu)\).


  1. Dobson and Barnett, 3rd edn, Table 1.2↩︎