Linear Regression and Gradient Descent

Posted by George Yu on April 26, 2018

Introduction

Linear regression is the most fundamental model among all the regression models, it is simple yet powerful. In a nutshell, the goal of a linear regression model is to use a line to “fit” a dataset. Let’s explore how a linear regression model works.


Table of Contents

  1. Univariate Linear Regression
    1. MSE
    2. Gradient Descent
  2. Bivariate Linear Regression
  3. Multivariate Linear Regression
  4. Polynomial Fitting

Univariate Linear Regression

We know that any line in the Cartesian plane can be expressed by the equation, \[ y = mx + b \] where is the slope of the line and is the -intercept of the line. Since we are dealing with univariate (concerning one variable) linear regression, which, as the name suggests, can only fit a single parameter, let us now assume that for every line—in other words, let us only tweak the slope of the line to fit the dataset—also, let’s change the symbol of the slope of the line to instead of (just for conventional purposes). So the equation defining a line now becomes, \[ y = \theta x \]

First, we need to come up with a method to quantitatively determine how well a model (i.e. a line) has fitted the dataset, or alternatively, a measure of how badly it has fitted the dataset.

Consider the following toy dataset and two lines attempting to fit it, toy-dataset Figure 1: A toy dataset although the red line did not fit the dataset perfectly, but it was certainly more accurate than the black line. So we are now going to use a measure called MSE (Mean Squared Error) to quantify how bad did a model (in this case, a line) fit the dataset.

MSE

For a data point (also known as an example) that has some coordinates , our model will predict that it has the coordinates . The value is the prediction (output) of our model given an input , let’s define the hypothesis function with a specific as, \[ h_\theta(x) = \theta x \] Thus, the error of the prediction can be quantified by the distance between the prediction and the real value (a.k.a. the ground truth). Specifically, the error of a data point is defined as, \[ (h_\theta(x) - y)^2 \]

Now, the error of the model with respect to the whole dataset can be defined as, \[ \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \] where denotes the number of examples in the dataset (size of the dataset), and denotes the -th example in the dataset. In machine learning, we name the error of a model the cost function, and it is, by convention, denoted as . In our cases, takes in a parameter and outputs the cost or the “unfitness” of this on the dataset. So, in our case of univariate linear regression, the cost function is defined as, \[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \]

Gradient Descent

Now that we can quantify how well a model performs, our goal can be represented as, \[ \min_\theta J(\theta) \] We can imagine how there is one optimal value that has the smallest , and by either decreasing or increasing this will result in an increase in , so if we were to plot against , it would look like something like, costs Figure 2: Cost of different values

Next, we are going to use gradient descent to get to the optimal solution. We observe that the derivative of of the optimal solution is 0, and all values larger than the optimal one have a positive derivative, and all values smaller has a negative slope. Thus, for any random value we choose, by going in the negative direction of the derivative of at that value will result in a model closer to the optimal solution.

Concretely, for every step of gradient descent, we tweak the value by subtracting it by where is the derivative function of , we should repeat this process until the gradient converges, that is, is very close to 0. We can rewrite this as,

\[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta := \theta - \frac{\mathrm{d}}{\mathrm{d}\theta}J(\theta) \\
&\textrm{\}} \end{align} \]

And that the derivative of with respect to is, \[ \frac{\mathrm{d}}{\mathrm{d}\theta}J(\theta) = \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) \]

A problem quickly rises when the distance between our value and the optimal value is less than the gradient of at our value, in such cases, we would get further away from the optimal solution during each step of gradient descent and the error of our model will escalate rapidly, this is called gradient explosion. To address this issue, we introduce a new variable which is the learning rate (or step size) of gradient descent. Now, our gradient descent process becomes,

\[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta := \theta - \alpha\frac{\mathrm{d}}{\mathrm{d}\theta}J(\theta) \\
&\textrm{\}} \end{align} \]

Notice that by setting to some value between 0 and 1, we will solve the problem caused by too large gradients. But also, setting to values greater than 1 will increase the speed of gradient descent, and can speed up cases where the gradients are too small.

Bivariate Linear Regression

Now that we can fit one parameter, let’s add in another parameter (namely the -intercept), so that we can model any line on the Cartesian plane. We first change the definition of the hypothesis function to, \[ h_{\theta_0, \theta_1}(x) = \theta_0 + \theta_1 x \] where is the slope of the line, and is the -intercept. In machine learning, we often call the constant term the bias.

The cost function will be slightly changed in that now there are two parameters, specifically, \[ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(h_{\theta_0, \theta_1}(x^{(i)}) - y^{(i)}\right)^2 \] Plotting the cost over and , we will get something similar to this, costs-3d Figure 3: Cost of different and values

The biggest difference is how gradient descent works, since now we have 2 parameters to deal with, for each step of gradient descent, we need to update both values simultaneously. This is because if we first updated one value, then when we are calculating the gradient of the other value, it will be affected by this updated value (since both parameters are involved in the hypothesis function), so the other value cannot step down in the most optimized route, resulting in slower or even erroneous gradient descent. The gradient descent process can be expressed as, \[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1),\ j\in\{0, 1\}\ \textrm{simultaneously} \\
&\textrm{\}} \end{align} \] And the partial derivatives are, \[ \frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \begin{cases} \displaystyle \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right),\ j=0 \\[0.3em] \displaystyle \frac{1}{m} \sum_{i=1}^m \left[\left(h_\theta(x^{(i)}) - y^{(i)}\right)x^{(i)}\right],\ j=1 \end{cases} \]

Multivariate Linear Regression

Let’s now take a step further and generalize the linear regression model to fit any number of parameters! First, assume there are parameters (or features), each is now a vector of length , or, \[ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\[0.2em] x^{(i)}_2 \\ \vdots \\ x^{(i)}_n \end{bmatrix} \] Similarly, is now also a vector of length (or ), with an extra bias term , \[ \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} \]

Now we can define the hypothesis function as, \[ \begin{align} h_\theta(x) &= \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n \\
&= \theta_0 + \sum_{j=1}^n \theta_j x_j \end{align} \] For convenient purposes, let’s add in a term in front of every , and set all , now we can further simplify our hypothesis function using vector multiplication into, \[ \begin{align} h_\theta(x) &= \sum_{j=0}^n \theta_j x_j \\
&= \theta^\top x \end{align} \]

Next, the cost function looks identical to the cost function of univariate linear regression, except that is now defined differently, \[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \]

Finally, the gradient descent process is similar, \[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta),\ j\in\{0, 1, \dots, n\}\ \textrm{simultaneously} \\
&\textrm{\}} \end{align} \] where \[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[\left(h_\theta(x^{(i)}) - y^{(i)}\right)x^{(i)}_j\right],\ j\in\{0, 1, \dots, n\} \]

Polynomial Fitting

As a matter of fact, we can fit a curve to a dataset using linear regression! The trick is, we create polynomial terms and treat them as linear terms. Concretely, let’s look at the following dataset, another-toy-dataset Figure 4: yet another toy dataset

In this dataset, there are only one feature , and we are predicting the value. We see that a simple line cannot fit the dataset well, and we will be needing rather a quadratic to fit it. So, what we do is we create a new feature, and the value of this feature is dependent on a existing feature, namely, the new feature will be . Now, each example will have two features, and . Then we can use these two parameters to perform gradient descent.

Now, we are able to fit a polynomial to the dataset, let’s say our gradient descent yielded the parameters and , our model will become, \[ y = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 \]