Introduction

Linear regression is the most fundamental model among all the regression models, it is simple yet powerful. In a nutshell, the goal of a linear regression model is to use a line to “fit” a dataset. Let’s explore how a linear regression model works.

Univariate Linear Regression
1. MSE
2. Gradient Descent
Bivariate Linear Regression
Multivariate Linear Regression
Polynomial Fitting

Univariate Linear Regression

We know that any line in the Cartesian plane $Oxy$ can be expressed by the equation, \[ y = mx + b \] where $m \in \mathbb{R}$ is the slope of the line and $b \in \mathbb{R}$ is the $y$ -intercept of the line. Since we are dealing with univariate (concerning one variable) linear regression, which, as the name suggests, can only fit a single parameter, let us now assume that $b = 0$ for every line—in other words, let us only tweak the slope of the line to fit the dataset—also, let’s change the symbol of the slope of the line to $\theta$ instead of $m$ (just for conventional purposes). So the equation defining a line now becomes, \[ y = \theta x \]

First, we need to come up with a method to quantitatively determine how well a model (i.e. a line) has fitted the dataset, or alternatively, a measure of how badly it has fitted the dataset.

Consider the following toy dataset and two lines attempting to fit it, toy-dataset Figure 1: A toy dataset although the red line did not fit the dataset perfectly, but it was certainly more accurate than the black line. So we are now going to use a measure called MSE (Mean Squared Error) to quantify how bad did a model (in this case, a line) fit the dataset.

MSE

For a data point (also known as an example) that has some coordinates $(x, y)$ , our model will predict that it has the coordinates $(x, \theta x)$ . The value $\theta x$ is the prediction (output) of our model given an input $x$ , let’s define the hypothesis function $h$ with a specific $\theta$ as, \[ h_\theta(x) = \theta x \] Thus, the error of the prediction can be quantified by the distance between the prediction and the real value (a.k.a. the ground truth). Specifically, the error of a data point $(x, y)$ is defined as, \[ (h_\theta(x) - y)^2 \]

Now, the error of the model with respect to the whole dataset can be defined as, \[ \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \] where $m$ denotes the number of examples in the dataset (size of the dataset), and $(x^{(i)}, y^{(i)})$ denotes the $i$ -th example in the dataset. In machine learning, we name the error of a model the cost function, and it is, by convention, denoted as $J$ . In our cases, $J$ takes in a parameter $\theta$ and outputs the cost or the “unfitness” of this $\theta$ on the dataset. So, in our case of univariate linear regression, the cost function is defined as, \[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \]

Gradient Descent

Now that we can quantify how well a model performs, our goal can be represented as, \[ \min_\theta J(\theta) \] We can imagine how there is one optimal $\theta$ value that has the smallest $J(\theta)$ , and by either decreasing or increasing this $\theta$ will result in an increase in $J(\theta)$ , so if we were to plot $J(\theta)$ against $\theta$ , it would look like something like, costs Figure 2: Cost of different $\theta$ values

Next, we are going to use gradient descent to get to the optimal solution. We observe that the derivative of $J$ of the optimal solution is 0, and all $\theta$ values larger than the optimal one have a positive derivative, and all values smaller has a negative slope. Thus, for any random $\theta$ value we choose, by going in the negative direction of the derivative of $J$ at that $\theta$ value will result in a model closer to the optimal solution.

Concretely, for every step of gradient descent, we tweak the $\theta$ value by subtracting it by $J'(\theta)$ where $J'$ is the derivative function of $J$ , we should repeat this process until the gradient converges, that is, $J'(\theta)$ is very close to 0. We can rewrite this as,

\[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta := \theta - \frac{\mathrm{d}}{\mathrm{d}\theta}J(\theta) \\
&\textrm{\}} \end{align} \]

And that the derivative of $J$ with respect to $\theta$ is, \[ \frac{\mathrm{d}}{\mathrm{d}\theta}J(\theta) = \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right) \]

A problem quickly rises when the distance between our $\theta$ value and the optimal $\theta$ value is less than the gradient of $J$ at our $\theta$ value, in such cases, we would get further away from the optimal solution during each step of gradient descent and the error of our model will escalate rapidly, this is called gradient explosion. To address this issue, we introduce a new variable $\alpha$ which is the learning rate (or step size) of gradient descent. Now, our gradient descent process becomes,

\[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta := \theta - \alpha\frac{\mathrm{d}}{\mathrm{d}\theta}J(\theta) \\
&\textrm{\}} \end{align} \]

Notice that by setting $\alpha$ to some value between 0 and 1, we will solve the problem caused by too large gradients. But also, setting $\alpha$ to values greater than 1 will increase the speed of gradient descent, and can speed up cases where the gradients are too small.

Bivariate Linear Regression

Now that we can fit one parameter, let’s add in another parameter (namely the $y$ -intercept), so that we can model any line on the Cartesian plane. We first change the definition of the hypothesis function to, \[ h_{\theta_0, \theta_1}(x) = \theta_0 + \theta_1 x \] where $\theta_1$ is the slope of the line, and $\theta_0$ is the $y$ -intercept. In machine learning, we often call the constant term $\theta_0$ the bias.

The cost function will be slightly changed in that now there are two parameters, specifically, \[ J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^m \left(h_{\theta_0, \theta_1}(x^{(i)}) - y^{(i)}\right)^2 \] Plotting the cost over $\theta_0$ and $\theta_1$ , we will get something similar to this, costs-3d Figure 3: Cost of different $\theta_0$ and $\theta_1$ values

The biggest difference is how gradient descent works, since now we have 2 parameters to deal with, for each step of gradient descent, we need to update both $\theta$ values simultaneously. This is because if we first updated one $\theta$ value, then when we are calculating the gradient of the other $\theta$ value, it will be affected by this updated $\theta$ value (since both parameters are involved in the hypothesis function), so the other $\theta$ value cannot step down in the most optimized route, resulting in slower or even erroneous gradient descent. The gradient descent process can be expressed as, \[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta_0, \theta_1),\ j\in\{0, 1\}\ \textrm{simultaneously} \\
&\textrm{\}} \end{align} \] And the partial derivatives are, \[ \frac{\partial}{\partial\theta_j}J(\theta_0, \theta_1) = \begin{cases} \displaystyle \frac{1}{m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right),\ j=0 \\[0.3em] \displaystyle \frac{1}{m} \sum_{i=1}^m \left[\left(h_\theta(x^{(i)}) - y^{(i)}\right)x^{(i)}\right],\ j=1 \end{cases} \]

Multivariate Linear Regression

Let’s now take a step further and generalize the linear regression model to fit any number of parameters! First, assume there are $n$ parameters (or features), each $x^{(i)}$ is now a vector of length $n$ , or, \[ x^{(i)} = \begin{bmatrix} x^{(i)}_1 \\[0.2em] x^{(i)}_2 \\ \vdots \\ x^{(i)}_n \end{bmatrix} \] Similarly, $\theta$ is now also a vector of length $n+1$ (or $\theta \in \mathbb{R}^{n+1}$ ), with an extra bias term $\theta_0$ , \[ \theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} \]

Now we can define the hypothesis function as, \[ \begin{align} h_\theta(x) &= \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n \\
&= \theta_0 + \sum_{j=1}^n \theta_j x_j \end{align} \] For convenient purposes, let’s add in a $x^{(i)}_0$ term in front of every $x^{(i)}$ , and set all $x^{(i)}_0 = 1$ , now we can further simplify our hypothesis function using vector multiplication into, \[ \begin{align} h_\theta(x) &= \sum_{j=0}^n \theta_j x_j \\
&= \theta^\top x \end{align} \]

Next, the cost function looks identical to the cost function of univariate linear regression, except that $h$ is now defined differently, \[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left(h_\theta(x^{(i)}) - y^{(i)}\right)^2 \]

Finally, the gradient descent process is similar, \[ \begin{align} &\textrm{repeat until convergence \{} \\
&\qquad \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j} J(\theta),\ j\in\{0, 1, \dots, n\}\ \textrm{simultaneously} \\
&\textrm{\}} \end{align} \] where \[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[\left(h_\theta(x^{(i)}) - y^{(i)}\right)x^{(i)}_j\right],\ j\in\{0, 1, \dots, n\} \]

Polynomial Fitting

As a matter of fact, we can fit a curve to a dataset using linear regression! The trick is, we create polynomial terms and treat them as linear terms. Concretely, let’s look at the following dataset, another-toy-dataset Figure 4: yet another toy dataset

In this dataset, there are only one feature $x_1$ , and we are predicting the $y$ value. We see that a simple line cannot fit the dataset well, and we will be needing rather a quadratic to fit it. So, what we do is we create a new feature, and the value of this feature is dependent on a existing feature, namely, the new feature will be $x_1^2$ . Now, each example will have two features, $x_1$ and $x_1^2$ . Then we can use these two parameters to perform gradient descent.

Now, we are able to fit a polynomial to the dataset, let’s say our gradient descent yielded the parameters $\theta_0, \theta_1$ and $\theta_2$ , our model will become, \[ y = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 \]

Linear Regression and Gradient Descent