Linear Regression: Understanding the Matrix Calculus

Linear regression is possibly the most well-known machine learning algorithm. It tries to find a linear relationship between a given of set of input-output pairs. One notable aspect is that linear regression, unlike most of its peers, has a closed-form solution.

The mathematics involved in the derivation of this solution (also known as the Normal equation) is pretty basic. However, to understand the equation in its commonly-used form, we need to appreciate some matrix calculus. In this post, I will attempt to explain, from ground up, the linear regression formula along with the necessary matrix calculus. I do assume that you are familiar with matrices (like tranposes and matrix multiplication), and basic calculus.

The Basics

Given a set of data points , , , linear regression tries to find a line (or a hyperplane in higher dimensions) which maps the input to output . Here, each may be a d-dimensional tuple i.e., .

A linear function on can be represented as , where are real numbers. In order to make the formula more uniform, we assume that consists of an extra element which always equals to . So our required function becomes . We denote it as .

Using matrices, we can write in a much more compact form. Conventionally, we use column matrices to represent vectors. Thus we have

Our function thus can be written as , or equivalently, as .

If there are data points , the outputs corresponding to all these can be kept together in a column matrix as follows:

Note that this column matrix can be decomposed into the following product

It is important to keep in mind that even though the last product looks like a scalar multiplication, is in fact a matrix, and that the LHS and RHS are equal nevertheless.

Understandably, we call the first matrix in the decomposition . It is an matrix, each of whose rows represents a data point. Thus we can compute the outputs corresponding to all the data points using the single matrix product .

The Loss Function

To recapitulate, the linear function we want to learn is represented by the weight matrix and, for a given set of inputs , the output of the linear function is given by . We also have a set of values. The scalar is the actual output corresponding to the input . These ’s can also be treated together as a column matrix , whose element is the output.

Thus is the output given (or predicted) by our linear function, and is the actual output. The difference is simply . We can see that this matrix is a column matrix with rows.

The aim of linear regression is to minimize the errors as much as possible. But rather than try to minimize each error separately---which would be a hard task, as decreasing one error might cause another error to shoot up---we try to minimize the sum of squares of the individual errors.

If is a column matrix, the sum of squares of its elements is just the product . (Why?) Therefore, what we would like to minimize is . This quantity is known as the loss function of .

Minimizing the Loss Function

The solution to the linear regression problem is the point at which the loss function is the minimum. To find it, we simply find the derivative of the loss function and equate it to zero.

In general, functions may have multiple minima and/or maxima. Some functions may not even have a minimum¹. But here, we don’t have to worry about those cases. The sum of squares of errors, our loss function, is a quadratic function. It turns out a quadratic function (think about ), has only one extreme point. In our case, that extreme point happens to be the minimum².

Partial Derivatives

Our loss function depends on not one, but variables. So to find the minimum point, we need to take partial derivates with respect to all of these variables and equate to zero. Rather than take the derivatives separately, we can use the power of matrices to avoid the unnecessary work.

First, let us use a simple convention. To differentiate a scalar with respect to a column matrix of variables, we will differentiate the scalar using each variable in the column matrix, and collect the outputs in a column matrix. The output thus will have the same shape as the denominator.

If is a scalar, its derivative wrt will thus be:

As an example, if is an matrix, is

The Chain Rule

A relatively easy method for computing the derivative of the loss function is using the chain rule.

In the chain rule of calculus, if is a function of and in turn is a function of ,

Does the same rule work for matrix differentiation also? Let us find out.

We define , and . Here is a column matrix of size , and is a scalar.

Then what we wish to compute is . By the convention we are following, this is a column matrix. Similarly, is an matrix.

But what about ? It is a differentiation of a column matrix by another column matrix. Intuitively, the output should consist of the derivative of each element in with respect to each element in . But in which should the matrices be processed---should we take each element in the numerator, and then differentiate it with respect to each element in the denominator, or should we take each element in the denominator, and differentiate each element in the numerator with it?

We will follow the convention of populating the output matrix in the column-first manner. In the first case, we will get the following matrix. Note that we are assuming that the elements of are , as they are the errors corresponding to the data points In contrast, ’s indexing starts with : .

In the second case, our output would be a matrix.

There is a bigger problem: no matter which among the two we choose, the chain rule equation would not work. Remember that on the RHS of the scalar chain-rule equation was the product . In the matrix vesion, we have , a column matrix, as the first element. As we saw just now, the dimension of the second matrix may be either or . In both cases, we cannot multiply the first and second matrices. Are we stuck?

One thing we should realize at this point is that Matrix calculus, unlike the actual Calculus, is not a fundamental branch of Mathematics. It is just a shorthand for doing multiple calculus operations in a single shot. That is why we have different conventions for representing the results of operations.

With that in mind, note that even though we can’t multiply the matrices in the given order, they can be multiplied in the reverse order for one case. The product is indeed valid when has rows and columns. In fact, this choice is just an extention of our convention for computing derivatives of scalars with respect to column matrices. In both cases, we deal with one numerator value in one column. Furthermore, the result is a column matrix with rows, matching with our expected output.

But we cannot create rules out of thin air just because the dimensions match. Maths does not work that way. Let us do a proper check.

Our hypothesis is that

LHS is a column matrix consisting of ’s derivatives with respect to . In particular, consider

If we stack the above formula for in a column, we get our output. It would look as follows:

It is easy to see that the above matrix is the same as the following product:

Which is exactly what we expected! So we can happily conclude that our hypothesis is valid.

Finding the Derivatives

Let us go on to compute the derivative of the loss function.

The second term in the chain rule expansion of is which is . This is just like the example we did, and the derivative is , i.e., .

The first term is . Substituting for , we get

In the last step, disappears because it is independent of .

We now know how to compute this derivative. As per our convention, we can take each element in , and add to the output a column consiting of its derivates with respect to each weight in . After doing it just for one element we can see that the derivative is .

Multiplying the two derivatives using our very own chain rule, we find the derivative of the loss function to be

Getting the Minimum

Equating the last equation to zero, we finally get the normal equation:

Appendix

Just for completeness, I will now outline a Normal equation derivation that does not require the chain rule. Feel free to skip this section if you have already understood the method given above---unless you don’t want to miss out on some more Matrix calculus.

OK, let us start by expanding out the loss function.

To compute the partial derivatives of the second and third terms, let us first observe that is a column matrix of size . If we denote this product as , the third term becomes . Interestingly, the second term is then . This means that both the second and the third terms are equal, as both and are column vectors.

If are the elements of the column matrix , the second and third terms would be equal to . If we take the derivative with respect to , the result is simply , i.e., .

The First Term

Computing the derivative of the first term is a bit more involved. Observe that is a square matrix of size . Let us denote this product as for convenience. Now we will compute the derivative of wrt .

If stands for the row of matrix , is given by

And hence is

We will use to denote the above sum. Let us consider its partial derivative wrt a single weight , i.e.,

We can see that for ,

But for the term, we have

Adding together, we get

This can be decomposed into a matrix product as below:

In other words, is simply the sum of the row and the column of , multiplied with . But the column of is the row of . Therefore, the full partial derivative is

Now we can substitute back , and get

Adding Them Up

Now that we have the derivatives of all the terms, we can just combine them and get the full derivative.

This checks out with the derivative we got using the chain rule. So we will stop here.

As befitting the occasion, linear functions are examples of functions having neither a minimum nor a maximum ↩
It is also possible that the extremum of a quadratic function is a global maximum. Think about . But in that case, we can simply multiply it by to get a function with global minimum. ↩