Please note that the below calculations use the \(\texttt{lm}\) command in R, with a combination of \(\texttt{summary}\), \(\texttt{anova}\) and \(\texttt{coefficients}\) to see the required numerical output. If it, however, the case that difference commands could have been used and, in particular, the more general \(\texttt{glm}\) command could have been used instead. The actual steps required would be a little different but the results would be the same.
There is no fully right or wrong answers to these questions, but we rarely start an explanatory analysis without any information. In fact, if you ever start knowing nothing then make finding the background and context your first task. We will often come across unexpected features and our prior thoughts will sometimes be demonstrated incorrect, but be a detective and question every step.
We did not attempt to fit the model \(\texttt{birthweight} \sim \texttt{sex}\), that is \(\texttt{birthweight} = \alpha + \beta \ \texttt{sex}\) where \(\texttt{sex}\) is coded \(0\) for girls and \(1\) for boys. This would be represented as two horizontal lines on the data plot, which does not look like a sensible explanation for the data and so we should not expect a good fit.
Let us fit the model and plot the resulting equations and the data:
birthweight = read.table("https://rgaykroyd.github.io/MATH3823/Datasets/birthwt-numeric.txt", header=T)
summary(birthweight)
## sex age weight
## Min. :0.0 Min. :35.00 Min. :2412
## 1st Qu.:0.0 1st Qu.:37.00 1st Qu.:2785
## Median :0.5 Median :38.50 Median :2952
## Mean :0.5 Mean :38.54 Mean :2968
## 3rd Qu.:1.0 3rd Qu.:40.00 3rd Qu.:3184
## Max. :1.0 Max. :42.00 Max. :3473
attach(birthweight)
M1.fit = lm(weight~sex)
plot(age, weight, col=2-sex, pch=15+sex,
xlab = "Gestational age (weeks)",
ylab = "Birth weight (grams)")
legend(41,2800, c("Girl","Boy"), col = c(2,1), pch=c(15,16))
abline(h=M1.fit$coefficients[1] + c(0,1)*M1.fit$coefficients[2], col=c(2,1), lwd=1.5)
The model fitting summary output is:
summary(M1.fit)
##
## Call:
## lm(formula = weight ~ sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.33 -178.33 -42.67 228.00 449.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2911.3 81.5 35.720 <2e-16 ***
## sex 112.7 115.3 0.977 0.339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 282.3 on 22 degrees of freedom
## Multiple R-squared: 0.04162, Adjusted R-squared: -0.001941
## F-statistic: 0.9554 on 1 and 22 DF, p-value: 0.339
and we have fitted equations \(\texttt{birthweight} = 2911.3\) when \(\texttt{sex}=0\) and \(\texttt{birthweight} = 2911.3+112.7=3024\) when \(\texttt{sex}=1\).
Further, the overall birth weight mean and the mean birth weights of the girls and boys can be found as:
mean(weight)
## [1] 2967.667
weighted.mean(weight,sex==0)
## [1] 2911.333
weighted.mean(weight,sex==1)
## [1] 3024
We see that the mean of the girls is given by the intercept parameter and the mean of the boys by the sum of the two parameters. The overall mean birthweight is given by the simple average of these two group means (since there are equal numbers of girls and boys).
We did not expect a good fit and this is clear from the graph. To test the null hypothesis that including \(\texttt{sex}\) is important, or not, we consider the F-statistic, \(0.9554\), which follows an \(F_{1,22}\) distribution. The p=value is \(0.339\), not significant, and hence we conclude that \(\texttt{sex}\) is not important in this case.
schoolstudy = read.csv("https://rgaykroyd.github.io/MATH3823/Datasets/schoolstudy.csv")
plot(schoolstudy$calorie, schoolstudy$BMI)
lin.model = lm(schoolstudy$BMI~schoolstudy$calorie)
abline(lin.model)
cor(schoolstudy$calorie, schoolstudy$BMI)
## [1] 0.5944361
A linear model appears suitable, though points well scattered and so unlikely to be very reliable.
iron = read.csv("https://rgaykroyd.github.io/MATH3823/Datasets/iron.csv")
plot(iron$Year, iron$iron)
abline(h=30, lty=2)
myfit = lm(iron~Year, data=iron)
abline(myfit)
cor(iron$Year, iron$iron)
## [1] -0.8795975
A clear linear relationship with a correlation of -0.88 showing a high correlation. A linear model is suitable and likely to be reasonably reliable. Here, recalling the early information, the downward trend in iron purity reached the critical 30% around 2000 – with an even lower figures in 1990. It appears as if the quarry will have become uneconomical.
menapause = read.csv("https://rgaykroyd.github.io/MATH3823/Datasets/bmd.csv")
plot(menapause$age, menapause$Tscore)
themodel = lm(Tscore~age, data = menapause)
abline(themodel)
cor(menapause$age, menapause$Tscore)
## [1] -0.7713094
There is a definitely downward trend in Tscore with age, but it does not look as it a linear model fits very well. The relationship is non-linear with little change at lower ages and then a strong decline. Further there is very little variability in Tscore for younger ages but high variability for older ages. A non-linear model is required to get a good fit.
skills = read.csv("https://rgaykroyd.github.io/MATH3823/Datasets/skills.csv")
plot(skills$Age, skills$Skills, ylim=c(0,20))
fitted.model = lm(Skills~Age, data=skills)
abline(fitted.model)
cor(skills$Age, skills$Skills, use="complete.obs")
## [1] 0.8840017
There appears to be a clear link between the social skills of children and the age of their eldest sibling – the older the more skilled with a high correlation. The miss-recorded age, 25 instead of 15, should be corrected and the fitting process re-done. This shuld then provide a reliable linear model.
Notice that V=IR is a linear equation with intercept fixed at zero, \(y=\beta\, x\) say. To start the investigation, first plot the data:
Volts = c(4,8,10,12,14,18,20,24)
mAmps = c(11,24,30,36,40,53,58.5,70)
plot(mAmps, Volts,pch=16, ylim=c(0,24), xlim=c(0,70),
ylab = "Voltage (Volts)",
xlab = "Current (mAmps)")
From the graph, a linear model through the origin should fit well, but first let’s fit a regular linear model:
M1.fit = lm(Volts ~ mAmps)
summary(M1.fit)
##
## Call:
## lm(formula = Volts ~ mAmps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27447 -0.18187 -0.03197 0.13910 0.35692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.043001 0.215944 -0.199 0.849
## mAmps 0.342152 0.004886 70.022 5.71e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2503 on 6 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9986
## F-statistic: 4903 on 1 and 6 DF, p-value: 5.708e-10
Note that this command fits a model which includes the intercept but we see that the result of the t-test on the intercept parameter has a p-value of \(0.849\) saying that it is not significantly different from zero.
Let’s finish by fitting the suggested model
M2.fit = lm(Volts ~ mAmps-1)
summary(M2.fit)
##
## Call:
## lm(formula = Volts ~ mAmps - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28552 -0.20224 -0.02549 0.14514 0.34942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## mAmps 0.34126 0.00186 183.5 3.77e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2325 on 7 degrees of freedom
## Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
## F-statistic: 3.367e+04 on 1 and 7 DF, p-value: 3.769e-14
Here, we see that the fitted model is \(\texttt{Volts}=0.34126\, \texttt{mAmps}\) with \(\texttt{F-statistic}=3.367e+04\), which follows an \(F_{1,7}\) distribution, with a very small p-value and hence the slope parameter is highly significant. Seeing the fitted line along with the data reinforces that this is an almost perfect fit.
This does support Ohm’s Law, with a resistance of \(0.34 \Omega\) – given by the slope parameter estimate.
For information, this data set was collected by my father, Peter John Aykroyd, in 1956, while he was training as an electrical engineer.
Let \(\texttt{pulse}\) denote the rest pulse rate, \(\texttt{sex}\) is recorded as \(0\) for Men and \(1\) for Women, and \(\texttt{time}\) as \(0\) for Before and \(1\) for After. Then, we suggest a model, \(\texttt{pulse} = \alpha + \beta\, \texttt{sex} + \gamma\, \texttt{time} + \delta\, (\texttt{sex:time})\) to take into account that Men and Women might have different rest pulse rates, that there might be a change after a meal, and that Men and Women might be effected by a meal differently.
Suppose that we enter data row-by-row starting from the top-left. Then, \(\texttt{pulse}=(105, 79, 79, 103, 87, 97, 109, 87,\dots, 93,81)\) and our design matrix, without selected columns removed to avoid aliasing (that is to remove identifiability issues), is given by \[ X= \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ \end{bmatrix} \] where the first column represents the model intercept, the second indicates that the first 12 recorded values correspond to Men and the second 12 to Women; further the third column has blocks of 6 for the before and after for Men, then before and after for Women; finally the fourth column has the product of the second and third columns for the interaction term.
Defining the data vector and the design matrix in R,
pulse = c(105,79,79,103,87,97,109,87,86,109,100,101,74,73,82,78,86,77,81,80,90,90,93,81)
X=matrix(c(
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 0, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 0, 1, 0,
1, 1, 0, 0,
1, 1, 0, 0,
1, 1, 0, 0,
1, 1, 0, 0,
1, 1, 0, 0,
1, 1, 0, 0,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1
), ncol=4, byrow=T)
and then the matrix regression equation
beta = solve(t(X) %*% X) %*% t(X) %*% pulse
beta
## [,1]
## [1,] 91.66667
## [2,] -13.33333
## [3,] 7.00000
## [4,] 0.50000
Using the \(\texttt{lm}\) command in R gives:
sex = X[,2] # this is easier than re-typing the 0 and 1's
time = X[,3]
my.fit.1 = lm(pulse ~ sex*time)
coefficients(my.fit.1)
## (Intercept) sex time sex:time
## 91.66667 -13.33333 7.00000 0.50000
Notice that the coefficients, as expected, are identical to those obtained using the matrix regression equation above.
To check if the change in pulse rate due to a meal is the same for men and women we test the interaction term. From the anova table, we see that the \(\texttt{sex:time}\) interaction is not significant, with a p-value of \(0.9440\). Hence, there is no significant interaction and there is no evidence that the change is different.
anova(my.fit.1)
Finally, we fit the reduced model without the interaction term:
my.fit.2 = lm(pulse ~ sex+time)
summary(my.fit.2)
##
## Call:
## lm(formula = pulse ~ sex + time)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.792 -4.896 0.375 5.917 13.458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.542 2.971 30.809 < 2e-16 ***
## sex -13.083 3.431 -3.813 0.00101 **
## time 7.250 3.431 2.113 0.04673 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.404 on 21 degrees of freedom
## Multiple R-squared: 0.4751, Adjusted R-squared: 0.4251
## F-statistic: 9.503 on 2 and 21 DF, p-value: 0.001151
This gives final parameter estimates, all are significant, and shows that the model fits the data well, with an chi-squared statistic of \(8.404\) which follows a \(\chi^2_{21}\) distribution and has a p-value of:
pchisq(8.404, 21, lower.tail = F)
## [1] 0.9931796
In conclusion, the best fitted model is: \(\texttt{pulse}=91.5-13.1\, (\texttt{sex}) +7.25\, (\texttt{time})\) and hence the average pulse rate is about \(91\) beats/minute, the rate is \(13\) beats/minute lower for women than men, and there is a \(7\) beats/minute increase after a meal and this change in pulse rate is the same for men and women.
A suitable model might say that seedling height depends on seed type, watering conditions and a possible interaction – that is some types of barley are more, or less, sensitive to water conditions. From the description, it seems that the incubator shelf level is not of interest.
barley = read.csv("https://rgaykroyd.github.io/MATH3823/Datasets/barley.csv")
attach(barley)
The response variable in the data set is \(\texttt{Height}\) with explanatory variables \(\texttt{Variety}\) with levels G, M and Ig, \(\texttt{Watering}\) with levels N and W, and \(\texttt{Position}\) with levels Top, Second, Third and Bottom.
Following the guidance in the question, the most complicated model that we might consider is \[ (\texttt{Height})_{ijk} = \texttt{mean} + (\texttt{Variety})_i + (\texttt{Watering})_j + (\texttt{Position})_k + (\texttt{Variety}\texttt{Watering})_{ij} \] where \(i=1,2,3\) for G, M and Ig, \(j=1,2\) for N and W, and \(k=1,2,3,4\) for Top, Second, Third and Bottom.
Variety = as.factor(Variety)
Watering = as.factor(Watering)
Position = as.factor(Position)
M1.fit = lm(Height ~ Variety + Watering + Position + Variety:Watering)
anova(M1.fit)
The \(\texttt{Variety:Watering}\) interaction terms has an F-statistics of \(1.0055\) which follows an \(F_{2,15}\) distribution and has a p-value of \(0.3892\) – which is non-significant. Hence there is no evidence that difference varieties respond differently to water levels.
M2.fit = lm(Height ~ Variety + Watering )
anova(M2.fit)
From this anova table it appears that none of the variables are significant. This might suggest that the natural variability is large and that a bigger sample would be needed to identify any pattern.