Most Recent — Dr Michael Cornish

Here's a quick derivation of the normal equations for multi-linear regression under ordinary least squares in vectorized form.

First, we have our model

$$\begin{equation}y = \boldsymbol{\beta}^T\boldsymbol{x} + \varepsilon\end{equation}$$

where $y$ is a scalar, $\boldsymbol{\beta}$ the column vector of coefficients, $\boldsymbol{x}$ the column vector of regressors, and $\varepsilon$ is a random variable that we will assume is mean zero. The sum of the squared error is then

$$\begin{equation}SSE = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (\hat{y}_i - y_i)^2\end{equation}$$

where $\hat{y}_i$ represents the predicted value of the data (i.e. just the predicted mean given $\boldsymbol{x}$) while $y_i$ is the observed values of the output. We have $n$ data points. This can be written more succinctly as

$$\begin{equation}SSE = \boldsymbol{e}^T\boldsymbol{e} = (\hat{\boldsymbol{y}}-\boldsymbol{y})^T(\hat{\boldsymbol{y}}-\boldsymbol{y}) = (\boldsymbol{X}\boldsymbol{b}-\boldsymbol{y})^T(\boldsymbol{X}\boldsymbol{b}-\boldsymbol{y})\end{equation}$$

where $\boldsymbol{e}$ is the column vector with each element a different error for each data point and similarly for $\hat{\boldsymbol{y}}$ and $\boldsymbol{y}$. Now, the matrix $\boldsymbol{X}$ contains all the data, with rows representing a new data point and columns the regressing variables. We've added a column of ones to the left hand side to account for the intercept term. Finally, $\boldsymbol{b}$ is the vector of estimates for the true $\boldsymbol{\beta}$, which we are trying to solve for. We can do so my minimizing the cost function. In other words, we want the sum of squares to be as small as possible. The SSE is a convex function, so setting the derivatives with respective to $\boldsymbol{b}$ to zero and solving for $\boldsymbol{b}$ will give us the minimum. The system of equations to solve for after differentiating and setting to zero are called the normal equations. Let's first rearrange the SSE a bit more before differentiating.

$$\begin{eqnarray}SSE &=& (\boldsymbol{X}\boldsymbol{b}-\boldsymbol{y})^T(\boldsymbol{X}\boldsymbol{b}-\boldsymbol{y}) \nonumber \\ &=& (\boldsymbol{b}^T\boldsymbol{X}^T-\boldsymbol{y}^T)(\boldsymbol{X}\boldsymbol{b}-\boldsymbol{y}) \nonumber \\ &=& \boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{b} - \boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{y} - \boldsymbol{y}^T\boldsymbol{X}\boldsymbol{b} + \boldsymbol{y}^T\boldsymbol{y}\end{eqnarray}$$

The SSE is a scalar. So, every term in the above equation is a scalar. The transpose of a scalar is just itself. Hence we can combine the two middle terms in the right hand side by transposing one of them and then set the whole right hand side to zero to get

$$\begin{equation}0 = \boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{b} - 2\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{b} + \boldsymbol{y}^T\boldsymbol{y}\label{eq:SSE_0}\end{equation}$$

We may differentiate each term in turn. If you're not familiar with differentiating by a vector, it's actually quite straight forward. Let $Q(\boldsymbol{b})$ be a scalar function of a vector and $\boldsymbol{h}$ an arbitrary vector of the same size as $\boldsymbol{b}$. Then,

$$\begin{equation}\frac{d Q(\boldsymbol{b})}{d \boldsymbol{b}} = \lim_{t\to 0} \frac{Q(\boldsymbol{b}+\boldsymbol{h}t)-Q(\boldsymbol{b})}{t}\end{equation}$$

Now let us differentiate each term in $\eqref{eq:SSE_0}$. We can write the first term as

$$\begin{equation}Q_1(\boldsymbol{b}) = \boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{b} \end{equation}$$

and so, after some work we can write,

$$\begin{equation}Q_1(\boldsymbol{b}+\boldsymbol{h}t) = \boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{b} + 2t\boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{h} + t^2 \boldsymbol{h}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{h}\end{equation}$$

Thus,

$$\begin{equation}\lim_{t\to 0} \frac{Q_1(\boldsymbol{b}+\boldsymbol{h}t)-Q_1(\boldsymbol{b})}{t} = \lim_{t\to 0} \frac{ 2t\boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{h} + t^2 \boldsymbol{h}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{h}}{t} = 2\boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{h}\end{equation}$$

Similarly, the second term in $\eqref{eq:SSE_0}$ can be written as,

$$\begin{equation}Q_2(\boldsymbol{b}) = 2\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{b} \end{equation}$$

and so we get

$$\begin{equation}\frac{d Q_2}{d \boldsymbol{b} } = 2\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{h} \end{equation}$$

Finally, the derivative of the third term is simply zero. Putting these together we get the normal equations

$$\begin{equation}0 = 2\boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{h} - 2\boldsymbol{y}^T\boldsymbol{X}\boldsymbol{h} = 2(\boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X}- \boldsymbol{y}^T\boldsymbol{X})\boldsymbol{h}\end{equation}$$

The vector $\boldsymbol{h}$ is arbitrary, so we may conclude that the term in the parentheses is identically zero. Therefore,

$$\begin{equation}\boldsymbol{b}^T\boldsymbol{X}^T\boldsymbol{X} = \boldsymbol{y}^T\boldsymbol{X}\end{equation}$$

Or, transposing everything,

$$\begin{equation}\boldsymbol{X}\boldsymbol{X}^T\boldsymbol{b} = \boldsymbol{X}^T\boldsymbol{y}\end{equation}$$

If $\boldsymbol{X}\boldsymbol{X}^T$ is non-singular, i.e. its determinant is not equal to zero, then we can invert it to solve for the coefficients as

$$\begin{equation}\boldsymbol{b} = (\boldsymbol{X}\boldsymbol{X}^T)^{-1}\boldsymbol{X}^T\boldsymbol{y}\end{equation}$$

Notice that all the input data are just real numbers, while the $y_i$ data can be treated as random variables. Therefore, $\boldsymbol{b}$ is dependent on a sum of random variables. It is a random variable itself. Happily, $\mathbb{E}(\boldsymbol{b}) =\boldsymbol{\beta}$; our estimate is unbiased.