Linear regression is possibly the most well-known machine learning algorithm. It tries to find a linear relationship between a given of set of input-output pairs. One notable aspect is that linear regression, unlike most of its peers, has a closed-form solution.
The mathematics involved in the derivation of this solution (also known as the Normal equation) is pretty basic. However, to understand the equation in its commonly-used form, we need to appreciate some matrix calculus. In this post, I will attempt to explain, from ground up, the linear regression formula along with the necessary matrix calculus. I do assume that you are familiar with matrices (like tranposes and matrix multiplication), and basic calculus.
The Basics
Given a set of
A linear function on
Using matrices, we can write
Our function
If there are
Note that this column matrix can be decomposed into the following product
It is important to keep in mind that even though the last product looks like a
scalar multiplication,
Understandably, we call the first matrix in the decomposition
The Loss Function
To recapitulate, the linear function we want to learn is represented by the
weight matrix
Thus
The aim of linear regression is to minimize the errors as much as possible. But rather than try to minimize each error separately---which would be a hard task, as decreasing one error might cause another error to shoot up---we try to minimize the sum of squares of the individual errors.
If
Minimizing the Loss Function
The solution to the linear regression problem is the point
In general, functions may have multiple minima and/or maxima. Some functions may
not even have a minimum1. But here, we don’t have to worry about those
cases. The sum of squares of errors, our loss function, is a quadratic function.
It turns out a quadratic function (think about
Partial Derivatives
Our loss function depends on not one, but
First, let us use a simple convention. To differentiate a scalar with respect to a column matrix of variables, we will differentiate the scalar using each variable in the column matrix, and collect the outputs in a column matrix. The output thus will have the same shape as the denominator.
If
As an example, if
The Chain Rule
A relatively easy method for computing the derivative of the loss function is using the chain rule.
In the chain rule of calculus, if
Does the same rule work for matrix differentiation also? Let us find out.
We define
Then what we wish to compute is
But what about
We will follow the convention of populating the output matrix in the
column-first manner. In the first case, we will get the following
In the second case, our output would be a
There is a bigger problem: no matter which among the two we choose, the chain
rule equation would not work. Remember that on the RHS of the scalar chain-rule
equation was the product
One thing we should realize at this point is that Matrix calculus, unlike the actual Calculus, is not a fundamental branch of Mathematics. It is just a shorthand for doing multiple calculus operations in a single shot. That is why we have different conventions for representing the results of operations.
With that in mind, note that even though we can’t multiply the matrices in the
given order, they can be multiplied in the reverse order for one case. The
product
But we cannot create rules out of thin air just because the dimensions match. Maths does not work that way. Let us do a proper check.
Our hypothesis is that
LHS is a column matrix consisting of
If we stack the above formula for
It is easy to see that the above matrix is the same as the following product:
Which is exactly what we expected! So we can happily conclude that our hypothesis is valid.
Finding the Derivatives
Let us go on to compute the derivative of the loss function.
The second term in the chain rule expansion of
The first term is
In the last step,
We now know how to compute this derivative. As per our convention, we can take
each element in
Multiplying the two derivatives using our very own chain rule, we find the derivative of the loss function to be
Getting the Minimum
Equating the last equation to zero, we finally get the normal equation:
Appendix
Just for completeness, I will now outline a Normal equation derivation that does not require the chain rule. Feel free to skip this section if you have already understood the method given above---unless you don’t want to miss out on some more Matrix calculus.
OK, let us start by expanding out the loss function.
To compute the partial derivatives of the second and third terms, let us first
observe that
If
The First Term
Computing the derivative of the first term is a bit more involved. Observe that
If
And hence
We will use
We can see that for
But for the
Adding together, we get
This can be decomposed into a matrix product as below:
In other words,
Now we can substitute back
Adding Them Up
Now that we have the derivatives of all the terms, we can just combine them and get the full derivative.
This checks out with the derivative we got using the chain rule. So we will stop here.