We have that the estimates of the natural parameter is given by, \(\hat \theta = (b')^{-1}(\bar y)\), hence
For the binomial, \(b(\theta)=m\, \log(1+e^\theta)\) and so \(b'(\theta)= m\, e^\theta/(1+e^\theta)\). The estimate, \(\hat \theta\) is given by the solution of \[ b'(\hat\theta)= m \frac{e^{\hat\theta}}{1+e^{\hat\theta}} = \bar y \] and so \(\theta\) can be estimated as \[\hat \theta = \log \left\{ \frac{\bar y /m}{1-\bar y/m} \right\} = \log \left\{ \frac{\bar y }{m-\bar y} \right\}. \] We know that the estimator, \(\hat \theta = \log \left\{ {\bar Y }/{(m-\bar Y)} \right\}\) is asymptotically unbiased, at least. Its variance is given by \[ \mbox{Var}(\hat\theta) = \frac{\phi}{n\, b''(\theta_0)} \] where \(\theta_0\) is the true value. Here \[ b''(\theta)= m \frac{e^{\theta}}{(1+e^{\theta})^2} \] and hence \[ \mbox{Var}(\hat\theta) = \frac{1}{nm}\frac{(1+e^{\theta_0})^2}{e^{\theta_0}} \]
Notice that this depends on the unknown true value \(\theta_0\).
For the geometric, \(b(\theta)= \theta-\log(1-e^\theta)\) and so \(b'(\theta)= 1+ e^\theta/(1-e^\theta)\). The estimate, \(\hat \theta\) is given by the solution of \[ b'(\hat\theta)= 1+ e^{\hat\theta}/(1-e^{\hat\theta}) = \bar y \] and so \[ \hat \theta = \log \left\{ 1-{1}\big/{\bar y} \right\}. \]
We know that the estimator, \(\hat \theta = \log \left\{1-1/\bar Y \right\}\) is asymptotically unbiased, at least. To find the variance, first \[ b''(\theta)= \frac{e^{\theta}}{(1-e^{\theta})^2} \] and hence \[ \mbox{Var}(\hat\theta) = \frac{1}{n}\frac{(1-e^{\theta_0})^2}{e^{\theta_0}} \]
For the exponential, \(\theta=-\lambda\) and \(b(\theta)=-\log (-\theta)\) (these are not in our notes and hence need to first write exponential distribution in exponential family form). Hence, \(b'(\theta)= -{1}/{\theta}\). The estimate, \(\hat \theta\) is given by the solution of \[ b'(\hat\theta)= -\frac{1}{\hat \theta} = \bar y \] and so \[\hat \theta = -{1}\big/{\bar y}. \] For the variance, \[ b''(\theta)= \frac{1}{\theta^2} \] and hence \[ \mbox{Var}(\hat\theta) = \frac{\theta^2}{n}. \]
For the exponential family form of the normal distribution we have \(\theta=\mu\), \(\phi=\sigma^2\), \(b(\theta)=\theta^2/2\), hence \(\hat\theta=\hat\mu\). Then, @eq-ehattheta gives \[ \mbox{E}[\hat\mu]=\mbox{E}[\hat\theta]\approx \theta_0 = \mu. \] Further, since \(b''(\theta)=1\), then \[ \mbox{Var}[\hat\theta]= \frac{\sigma^2}{n}. \] Hence, the approach based on the exponential fmily leads to the familiar results – as we would expect.
Following the principle of maximum likelihood estimation we maximizes the log-likelihood function, that is \[ \hat{\boldsymbol{\beta}} = \max_{\boldsymbol{\beta}} l (\boldsymbol{\beta}; \mathbf{y}, \phi). \] where \(l (\boldsymbol{\beta}; \mathbf{y}, \phi)\) is the log-likelihood of the parameters \(\boldsymbol{\beta}\) given data \(\mathbf{y}\) and assuming that the nuisance scale parameter is fixed.
For the normal linear regression model the log-likelihood is given by \[ l = -\frac{n}{2}\log (2\pi \sigma^2) -\frac{1}{2\sigma^2}(\mathbf{y}-X\boldsymbol{\beta})^T(\mathbf{y}-X\boldsymbol{\beta}). \] Differentiating with respect to vector \(\boldsymbol{\beta}\) gives \[ \frac{d\, l}{d\, \boldsymbol{\beta}} = -\frac{1}{\sigma^2} X^T(\mathbf{y}-X\boldsymbol{\beta}) \] which represents a system of \(p\) equations in \(p\) unknown. Setting this to zero gives the required result of @eq-linearregressionmle, \[ \hat{\boldsymbol{\beta}} = \left(X^T X\right)^{-1} X^T \mathbf{y}. \]
Under the normal model, with \(\mu=\theta\), \(b(\theta) = \theta^2/2\), \(\phi=\sigma^2\), likelihood takes the form \[ l = \sum_{i=1}^n \{[\mu_i y_i - \mu_i^2/2]/\sigma^2 + c(y_i,\phi)\} \] let \(\hat{l}\) denote the maximized likelihood under the generalized linear model of interest, with \(\mu_i\) replaced by \(\hat{\mu}_i\).
Under the saturated model, \(\mu_i\) is estimated by maximizing the \(i\)th term of the likelihood, yielding \(\tilde{\mu} = y_i\). Substituting these values and noting that \(y_i y_i - y_i^2/2 = y_i^2/2\) yields the maximized saturated log-likelihood \[ \tilde{l} = \sum_{i=1}^n \{[y_i^2/2]/\sigma^2 + c(y_i,\phi)\} \] Hence the deviance reduces to the residual sum of squares \[ D = 2 \phi (\tilde{l} - \hat{l}) = \sum_{i=1}^n (y_i - \hat{\mu}_i)^2. \]
Binomial data consist of the number of successes \(y_i\) out of \(m_i\) trials, \(i=1, \ldots, n\). The likelihood takes a form that is slightly modified from the general case, \[\begin{align*} l &= \sum_{i=1}^n \{\theta_i y_i - b_i(\theta_i) + c(y_i,m_i)\}\\ &= \sum_{i=1}^n \{y_i \log(p_i) + (m_i-y_i) \log(1-p_i) + c(y_i,m_i)\}. \end{align*}\] where \(\theta_i = \operatorname{logit}(p_i)\) and \[ p_i = \mu_i/m_i = E(Y_i)/m_i = b'_i(\theta_i)/m_i \] is the probability of success in the \(i\)th set of trials, and \(b_i(\theta_i) = m_i \log(1+e^{\theta_i})\). Also, \[ \exp\{c(y_i,m_i)\} = \begin{pmatrix} m_i \\ y_i \end{pmatrix}. \]
Under the saturated model \(p_i\) is estimated by \(\tilde{p_i} = y_i/m_i\). Hence the deviance takes the form \[ D = 2 (\tilde{l} - \hat{l}) = 2 \sum_{i=1}^n \left\{y_i \log \left(\frac{y_i}{\hat{\mu}_i} \right) + (m_i - y_i) \log \left( \frac{m_i-y_i}{m_i-\hat{\mu}_i} \right) \right\}. \]
In general \(\hat{p}_i = \text{logit}^{-1}(\hat{\beta}^T x_i)\) will always lie strictly between \(0\) and \(1\). However, if \(y_i = 0\) or \(m_i\), then the saturated estimate \(\tilde{p}_i\) will equal 0 or 1. Suppose that \(y_i=0\) for some \(i\). Then the \(i\)th term of the saturated likelihood becomes \(m_i + c(0,m_i)\). You should convince yourself that the formula for the deviance remains valid if we interpret \(0 \log 0 = 0\). Similarly if \(y_i = m_i\).
In the Poisson case the likelihood becomes \[\begin{align*} l &= \sum_{i=1}^n \{\theta_i y_i - b(\theta_i) + c(y_i)\}\\ &= \sum_{i=1}^n \{y_i \log \lambda_i - \lambda_i + c(y_i)\}, \end{align*}\] where \(\mu_i = \lambda_i = \log \theta_i\) and \(b(\theta_i) = \exp(\theta_i)\). Under the saturated model \(\tilde{\lambda}_i = y_i\). For most models of interest, it can be shown that \(\sum \hat{\lambda}_i = \sum y_i\). Hence the deviance reduces to \[ D = 2 (\tilde{l} - \hat{l}) = 2 \sum_{i=1}^n \left\{ y_i \log \left( y_i/\hat{\mu_i} \right) \right\}. \]
In the notes, we analysed the birthweight data using \(F\)-tests based on the residual sum of squares \(R\) and residual degrees of freedom \(r'\) (note the change of notation — \(r\) was used for degrees of freedom, but later we have used \(r\) for the number of parameters). In other other places, we described tests using the deviance \(D\) and number of parameters \(r\).
Since \(D = R\) for a normal GLM and \(r' = n - r\), where \(n\) is the number of observations, the \(F\)-statistic comparing Model 0 and Model 1 can be written \[\begin{equation*} F_{01} = \frac{(R_0 - R_1) / (r_0' - r_1')} {R_1 / r_1'} = \frac{(D_0 - D_1) / (r_1 - r_0)} {D_1 / (n-r_1)}, \end{equation*}\] which we compared to an \(F\)-distribution on \(r_1 - r_0\) and \(n - r_1\) degrees of freedom. This is almost equivalent to the test statistic: \[\begin{equation*} \frac{(D_1 - D_2) / (r_2 - r_1)}{D_3/(n-r_3)}, \end{equation*}\] which we compared to an \(F_{r_2 - r_1, n - r_3}\) distribution.
The only differences are (i) the arbitrary indexing of models (0,1) or (1,2) and (ii) that in the second form we use a ``large’’ Model 3 to estimate \(\sigma^2\) by \(\hat{\phi} = D_3/(n-r_3)\). In practice the differences will usually be small and the two approaches to variable selection are equivalent.
Start this question by creating an R-Script file entering appropriately chosen R commands, such as those below.
# Define variables, your choice of names, for the data
dose = seq(20,100,by=20)
m = c(23,18,14,23,22) # This is the Group size
y = c(20,14,5,6,1) # This is the number re-treated
# Now plot the data as a scatter plot
par(mfrow=c(1,2))
plot(dose,y/m, pch=16, ylim=c(0,1), xlim=c(0,120),
ylab="Proportion re-treated", xlab="Dose, mg/kg")
abline(h=c(0,1), lty=2)
# Create a data frame to hold the data for the lm
data = data.frame(dose=dose, prop=y/m)
my.lm = lm(prop~dose, data=data)
abline(my.lm)
# Now get ready for the glm
plot(dose,y/m, pch=16, ylim=c(0,1), xlim=c(0,120),
ylab="Proportion re-treated", xlab="Dose, mg/kg")
abline(h=c(0,1), lty=2)
# Now re-define the response as "successes" and "failures"
response = cbind(y,m-y)
my.glm = glm(response~dose, family = "binomial")
# For the fitted curve evaluate the fitted function at many doses
newval = data.frame(dose=seq(0,125,length.out=100))
predicted = predict(my.glm, newdata = newval, type="response")
lines(newval$dose, predicted)
Although both models fit the data well, as the data points are never far from the fitted functions, knowing that proportions can never be outside the interval zero to one, then the binomial logistic model is preferable.
To emphasize the similarity of the fitted model within he range of the data, consider the model residuals as shown in the following graphs.
par(mfrow=c(1,2))
plot(my.lm$fitted, my.lm$residuals, pch=16, ylim=0.2*c(-1,1),
xlab="Fitted values", ylab="Residuals")
abline(h=0, lty=2)
glm.resids = residuals(my.glm, type="response")
plot(my.glm$fitted, glm.resids, pch=16, ylim=0.2*c(-1,1),
xlab="Fitted values", ylab="Residuals")
abline(h=0, lty=2)
Although these are not identical, there is no clear better. Further, the RSS for the linear model is 0.022 whereas for the logistic model it is 0.0161. From the data alone it is impossible to choose. Knowing the background, however, the better model is clearly the logistic.