Below are a few proofs regarding the least square derivation associated with multiple linear regression (MLR). These proofs are useful for understanding where MLR algorithm originates from. In particular, if one aims to write their own implementation, these proofs provide a means to understand:

  • What logic is being used?
  • How does the logic apply in a procedural form?
  • Why is this logic present?

Multiple Linear Regression (MLR) Definition




Design Matrix:



  • Symbols used above have the following meaning:
    • n: number of observations
    • p: number of variables
    • X: design matrix
    • y: response vector
    • $\beta$: parameter or coefficient vector
    • $\varepsilon$: random error vector
  • Within the scalar parameterization – given by $y_i$ – the design matrix term $x_i$ goes up to column $p - 1$ since one column of the design matrix is allocated to contain 1’s for $\beta_0$ – the intercept.
  • Some versions may instead display the design matrix as $p + 1$, which would mean there are $p$ variables and an intercept. Thus, the rows of the design matrix would be structured as:

Refresh of Matrix Derivatives

Before beginning, it is helpful to know about matrix differentiation. So, let’s quickly go over two differentiation rules for matrices that will be employed next.

Consider vectors and , then the derivative with respect to of the product is given as:

Now, consider the quadratic form (${\mathbf{b}^T}A\mathbf{b}$) with symmetric matrix $A_{pxp}$, then we have:

Note, if $A$ is not symmetric, then we can use:

Least Squares with Multiple Linear Regression (MLR)

Goal: Obtain the minimization of RSS.


RSS Definition:

Note: $e \neq \varepsilon$ since $e$ is the realization of $\varepsilon$ from the regression procedure.

Expand RSS:


We are able to perform a transpose in place as the result is scalar.

Take the derivative with respect to $\beta$:

Set equal to zero and solve:

Mean of LS Estimator for MLR

Next up, let’s take the mean of the estimator!


  • We substituted in the definition of $y = X\beta + \varepsilon$ and then simplified the matrix
  • $\beta$ is a constant within the expectation and, thus, we pulled it out.


  • Used the law of total expectation $E\left[ X \right] = E\left[ {E\left[ {X Y} \right]} \right]$.
  • Showed that the estimator was unbiased under the exogeneity assumption that the mean of the residuals is 0.

Covariance of the LS Estimator for MLR

To perform inference, we’ll need to know the covariance matrix of $\hat{\beta}$.

Note: The above calculations are useful in multiple regression paradigms with minimal modification.

Note: Under homoscedasticity, variance of the errors term is constant, assumption, we assume that


Based on the above work, we have the following results…