Linear Regression: A Tale of a Transform

What is linear regression?

In machine learning, we learn(fit) a model to some data using only a small sample of the data called the training set. The model is then used to make predictions about similar but unseen data. Linear regression is a model that represents data using a stright line. Linear regression got its name for historical reasons. In simple language it means line fitting.

But why do we need learn or fit a model, in our case a line? Because a model lets us approximate the data using only a few quantities(parameters) and draw conclusions that might not be evident looking at the bulk of the data.

In the rest of the post I will show you why linear regression counts as machine learning, when it is OK to use linear regression and how it actually works.

Most typical introductions to linear regression use matrix algebra. In this post, however, I will use simple algebra and calculus to derive all equations. Interestingly, Python has a very powerful symbolic math library called SymPy that will let us do all calculation like we do them with pen and paper.

The data we will use here comes from CMU's Body Fat dataset. This data is stored in a csv file. And in order to work with this data I need to access it inside the notebook. For this purpose I am using the Pandas library.

First I import the Pandas library and then use the read_csv() method to load our data. After that I print the first few rows from the data.

In [1]:
import pandas as pn
In [104]:
# Read data and print some rows
dataframe = pn.read_csv('bodyfat.csv', sep="\s+")
dataframe.head()
Out[104]:
Density Body fat (%) Age (years) Weight (lbs) Height (inches) Neck (cm) Chest (cm) Abdomen (cm) Hip (cm) Thigh (cm) Knee (cm) Ankle (cm) Biceps (cm) Forearm (cm) Wrist (cm)
0 1.0708 12.3 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1
1 1.0853 6.1 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2
2 1.0414 25.3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6
3 1.0751 10.4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2
4 1.0340 28.7 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7

As we can see from the data for the first 5 persons, there are a lot of columns. It'd be interesting to see how the column values are related to each other. A good approach is to use a scatter plot matrix to visualize dependencies between different columns. Data visualization is considered a crucial step while dealing with machine learning algorithms. The code below extracts a subset of the data, columns 3 to 8 and visualizes the relationship between every pair of columns from this subset using the Plotly library.

In [3]:
import numpy as np
df = pn.DataFrame(dataframe, columns=list(dataframe.columns.values)[3:8])
In [4]:
import plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
py.offline.init_notebook_mode()

fig = ff.create_scatterplotmatrix(df,diag='histogram',height=800, width=800)
py.offline.iplot(fig)

The plots falling on the main diagonal are actually histograms not scatter plots as it doesn't make sense to plot a column against itself. The histogram are interesting in their own self but we want to focus on relationships between pairs of columns which are all off the diagonal. There are some relationships that are almost linear. To me "Weight vs Abdomen" relationship appears to be a good choice. So let's focus on this relationship and visualize it inside a single zoomed in plot.

This relationship is ideal for fiting a line because it is 2-dimensional and all the data points seem to lie along a line. If the data points were not distributed like this trying to fit a line would not have made much sense. The goal now is to find such a line which is close enough to most of the points in the data set.

Before findinf the best line we need to 'hide' some of our data points. Once the best line is found we will use these hidden points to measure the accuracy of our predictions.

I'm going to split data set into two parts. The first part consists of 60% data and second one 40%. In ML domain these parts have special names: training data, and testing data. The training data is used to find the line while the test data is later used to find if this line provides good approximation for new unseen values.

One important thing to notice while spliting data is that both train and test sets should have almost same distribution. To ensure this, we split the data randomly.

In [6]:
# Split data between train and test set
nTrain = int(0.6 * weight_abdomen_data.shape[0])

train = weight_abdomen_data.iloc[0:nTrain]
test = weight_abdomen_data.iloc[nTrain :]

abdomen_train = train.iloc[:,1]
weight_train = train.iloc[:,0]

abdomen_test = test.iloc[:,1]
weight_test = test.iloc[:,0]

In the plot below we can see the how the data in two sets look like.

Now the question is how do we find a line that approximates the (training) data well. In the figure below the orange line passing through the middle of the data appears to be a good approximation. However, the green line which also passes through the center of the data does not appear to be such a good approxiation.

In [21]:
x_min = abdomen_train.min()
x_max = abdomen_train.max()
y_min = 120
y_max = 320
green_line = go.Scatter(x=[x_min,x_max], y=[y_min,y_max], mode='lines', name='line 1')
red_line = go.Scatter(x=[x_min,x_max], y=[80,400], mode='lines', name='line 2')

scatter_layout = go.Layout( 
    title='Candidate Lines',
    autosize=True,
    width=500,
    height=500,
    xaxis=dict(title='Abdomen (cm)'),
    yaxis=dict(title='Weight (lbs)'))

fig = go.Figure(data=[trace_scatter_train,green_line, red_line], layout= scatter_layout)
py.offline.iplot(fig)

We need an objective measure of how good a line. Since the whole point of fitting a line is to predict weight values from abdomen sizes, one way to measure the goodness-of-fit of a line is to measure the difference between the actual and predicted weights for every abdomen size in the training set, which is the sum of lengths of the vertical red lines in the plot below.

In [23]:
green_slope = (y_max-y_min)/(x_max-x_min)
green_intercept = -x_min*green_slope+y_min

scatter_layout = go.Layout( 
    title='Prediction Errors',
    autosize=True,
    width=1000,
    height=500,
    xaxis=dict(title='Abdomen (cm)'),
    yaxis=dict(title='Weight (lbs)'),
    showlegend=False)

traces = [trace_scatter_train,green_line]

for x,y in zip(abdomen_train,weight_train):
    y_pred = green_slope * x + green_intercept
    traces.append(go.Scatter(x=[x,x],y=[y,y_pred],mode='lines',line = dict(color = ('rgb(255, 0, 0)'))))

fig = go.Figure(data=traces, layout= scatter_layout)
py.offline.iplot(fig)

If for a given abdomen mesurement $x$, the actual weight is $y$ and the weight predicted using our line is $\hat y$ the error of prediction is $\lvert y-\hat y \rvert$. Taking the absolute value ensures that errors with opposing signs add up instead of cancelling out. The total error is $\sum_1^n {\lvert y-\hat y \rvert}$ if there are $n$ (abdomen,weight) pairs in the data. The average prediction error is $\sum_1^n {\lvert y-\hat y \rvert} \over n$. This mean abolute error is computed below using sympy but first an introduction to Sympy is in order.

Introduction to SymPy

As promised earlier I will use algebra and calculus to find the desired line. And in order to achieve this I'm going to use a symbolic mathematics library in Python called SymPy. In Sympy we can define variables, use operators and functions to combine the variables in to expressions and perform mathematical operations on these expression to get new expression just as we do with pen and paper.

SymPy.symbols() takes a string of variable names separated by spaces or commas, and creates Symbols out of them. For example, to define a quadratic $y=a*x^2+b*x+c$, we use the code below.

In [52]:
import sympy as sp

# Define the RHS variables as SymPy symbols.
a,b,c,x = sp.symbols('a b c x')
y = a*x*x + b*x + c

If I print the SymPy variable $y$, an expression gets pretty-printed.

In [55]:
sp.pretty_print(y)
   2          
aâ‹…x  + bâ‹…x + c

Suppose I know the value of $x$ is 80 then I can substitute this in the expression for $y$ as below.

In [56]:
z = y.subs([(x,80)])
print(z)
6400*a + 80*b + c

In our case we have the equation of a line $y=m*x+c$ and a sequence of data in the form of $(x_i,y_i)$, where the subscript $i$ stands for the $i$th data point. To find the absolute cost, for the $i$th data point, we substitute $x_i$ in the equation of our line. This gives us the predicted value $\hat y_i$ for that data point. We then subtract the predicted values from the actual value $y_i$ and take the absolute value of the difference to find the error of prediction.

In [74]:
x_i,y_i,m,c = sp.symbols('x_i,y_i,m,c')
yhat_i = m*x_i+c
abs_error_i = sp.Abs(y_i-yhat_i)
sp.pretty_print(abs_error_i)
│c + m⋅xᵢ - yᵢ│

The to find the mean absolute error(MAE) we need to sum up the absoluter error over all $(x_i,y_i)$ and then divide by the number of datpoints.

In [112]:
MAE = 0
count = len(abdomen_train)
for abd,wt in zip(abdomen_train,weight_train):
    MAE += abs_error_i.subs([(x_i,abd),(y_i,wt)])
In [113]:
MAE =  sp.Mul(MAE,1/count,evaluate=False)
sp.pretty_print(MAE,num_columns=100,wrap_line=False)
0.00662251655629139⋅(│c + 70.4⋅m - 127.5│ + │c + 73.7⋅m - 133.5│ + │c + 73.9⋅m - 133.25│ + │c + 74.6⋅m - 131.5│ + │c + 76.0⋅m - 125.25│ + │c + 76.0⋅m - 125.0│ + │c + 76.3⋅m - 151.25│ + │c + 76.4⋅m - 140.25│ + │c + 76.5⋅m - 143.75│ + │c + 77.1⋅m - 159.75│ + │c + 77.6⋅m - 152.25│ + │c + 77.6⋅m - 136.25│ + │c + 77.9⋅m - 139.25│ + │c + 78.0⋅m - 152.25│ + │c + 78.4⋅m - 155.25│ + │c + 78.8⋅m - 162.5│ + │c + 79.1⋅m - 168.0│ + │c + 79.5⋅m - 148.5│ + │c + 79.6⋅m - 152.75│ + │c + 79.7⋅m - 159.25│ + │c + 80.0⋅m - 148.75│ + │c + 81.5⋅m - 164.25│ + │c + 81.8⋅m - 148.25│ + │c + 81.9⋅m - 156.0│ + │c + 82.0⋅m - 137.25│ + │c + 82.5⋅m - 191.0│ + │c + 82.8⋅m - 146.75│ + │c + 82.9⋅m - 160.75│ + │c + 83.0⋅m - 173.25│ + │c + 83.1⋅m - 165.25│ + │c + 83.3⋅m - 160.25│ + │c + 83.3⋅m - 143.0│ + │c + 83.4⋅m - 135.75│ + │c + 83.5⋅m - 160.75│ + │c + 83.6⋅m - 186.25│ + │c + 83.9⋅m - 151.5│ + │c + 84.1⋅m - 161.0│ + │c + 84.5⋅m - 160.25│ + │c + 84.6⋅m - 156.75│ + │c + 85.2⋅m - 154.25│ + │c + 85.3⋅m - 188.15│ + │c + 86.0⋅m - 176.75│ + │c + 86.1⋅m - 166.75│ + │c + 86.1⋅m - 151.5│ + │c + 86.4⋅m - 184.75│ + │c + 86.5⋅m - 176.0│ + │c + 86.6⋅m - 158.0│ + │c + 86.6⋅m - 154.75│ + │c + 86.7⋅m - 162.5│ + │c + 86.7⋅m - 158.25│ + │c + 87.0⋅m - 165.25│ + │c + 87.2⋅m - 168.25│ + │c + 87.3⋅m - 165.5│ + │c + 87.5⋅m - 187.5│ + │c + 87.9⋅m - 154.0│ + │c + 88.1⋅m - 157.75│ + │c + 88.2⋅m - 156.5│ + │c + 88.5⋅m - 176.0│ + │c + 88.5⋅m - 168.25│ + │c + 88.6⋅m - 198.25│ + │c + 88.7⋅m - 194.0│ + │c + 88.7⋅m - 182.0│ + │c + 88.7⋅m - 148.0│ + │c + 88.8⋅m - 184.25│ + │c + 89.0⋅m - 150.25│ + │c + 89.2⋅m - 172.75│ + │c + 89.4⋅m - 157.0│ + │c + 89.6⋅m - 183.75│ + │c + 89.6⋅m - 163.0│ + │c + 89.7⋅m - 167.0│ + │c + 89.9⋅m - 167.0│ + │c + 90.0⋅m - 179.0│ + │c + 90.0⋅m - 177.25│ + │c + 90.1⋅m - 171.75│ + │c + 90.3⋅m - 171.25│ + │c + 90.6⋅m - 170.75│ + │c + 90.7⋅m - 181.0│ + │c + 90.9⋅m - 216.0│ + │c + 90.9⋅m - 160.0│ + │c + 91.0⋅m - 167.0│ + │c + 91.5⋅m - 167.5│ + │c + 91.6⋅m - 188.75│ + │c + 91.6⋅m - 180.5│ + │c + 92.0⋅m - 173.75│ + │c + 92.1⋅m - 177.5│ + │c + 92.3⋅m - 168.5│ + │c + 92.4⋅m - 191.0│ + │c + 92.4⋅m - 175.25│ + │c + 92.8⋅m - 162.75│ + │c + 93.0⋅m - 173.25│ + │c + 93.1⋅m - 176.75│ + │c + 93.2⋅m - 179.75│ + │c + 93.5⋅m - 192.25│ + │c + 94.0⋅m - 197.0│ + │c + 94.4⋅m - 210.25│ + │c + 94.7⋅m - 178.75│ + │c + 94.9⋅m - 170.75│ + │c + 95.0⋅m - 198.5│ + │c + 95.0⋅m - 178.25│ + │c + 95.0⋅m - 177.75│ + │c + 95.4⋅m - 161.25│ + │c + 95.5⋅m - 196.75│ + │c + 95.6⋅m - 177.0│ + │c + 95.8⋅m - 163.75│ + │c + 95.9⋅m - 179.0│ + │c + 96.4⋅m - 195.75│ + │c + 96.4⋅m - 187.75│ + │c + 97.5⋅m - 209.25│ + │c + 97.5⋅m - 192.5│ + │c + 97.8⋅m - 190.25│ + │c + 98.1⋅m - 185.25│ + │c + 98.3⋅m - 179.75│ + │c + 98.6⋅m - 187.75│ + │c + 98.6⋅m - 177.0│ + │c + 98.6⋅m - 171.25│ + │c + 98.8⋅m - 200.5│ + │c + 98.8⋅m - 196.75│ + │c + 99.1⋅m - 208.5│ + │c + 99.2⋅m - 224.5│ + │c + 99.2⋅m - 206.5│ + │c + 99.7⋅m - 178.0│ + │c + 99.8⋅m - 197.0│ + │c + 99.8⋅m - 181.5│ + │c + 99.8⋅m - 168.0│ + │c + 100.0⋅m - 198.0│ + │c + 100.0⋅m - 184.25│ + │c + 100.3⋅m - 183.5│ + │c + 100.5⋅m - 218.5│ + │c + 100.5⋅m - 211.75│ + │c + 100.5⋅m - 206.5│ + │c + 100.9⋅m - 202.25│ + │c + 101.1⋅m - 186.0│ + │c + 101.6⋅m - 203.25│ + │c + 101.8⋅m - 205.25│ + │c + 102.4⋅m - 193.25│ + │c + 102.8⋅m - 200.25│ + │c + 104.2⋅m - 201.25│ + │c + 104.3⋅m - 212.0│ + │c + 104.3⋅m - 205.0│ + │c + 104.8⋅m - 216.0│ + │c + 105.0⋅m - 183.25│ + │c + 105.3⋅m - 202.5│ + │c + 105.5⋅m - 205.5│ + │c + 106.6⋅m - 212.75│ + │c + 106.8⋅m - 223.0│ + │c + 108.1⋅m - 203.0│ + │c + 111.2⋅m - 217.0│ + │c + 113.1⋅m - 191.75│ + │c + 115.6⋅m - 247.25│ + │c + 126.2⋅m - 262.75│ + │c + 148.1⋅m - 363.15│)

The mean absolute error is a long, unwieldy expression. It is a function of $m$ and $c$ but because of the absolute values, there is no way to simplify it. It is also not possible to analytically minimize this expression. This brings us to mean squared error (MSE). MSE is different from MAE in that for every $(x_i,y_i)$ the error is calculated as the square of the difference $(y-\hat y)^2$.

In [122]:
sqrd_error_i = (y_i-yhat_i)**2
sp.pretty_print(sqrd_error_i)
                2
(-c - mâ‹…xáµ¢ + yáµ¢) 

The MSE for our data is calculated below. Just like the MAE calculated above, MSE has 151 terms but becasue the individual terms can be expanded and the resulting expression simplified, we are left with a second order polynomial in $m$ and $c$, which has 5 terms.

In [123]:
MSE = 0
count = len(abdomen_train)
for abd,wt in zip(abdomen_train,weight_train):
    MSE += sqrd_error_i.subs([(x_i,abd),(y_i,wt)])
MSE = sp.simplify(MSE)
MSE /= count
sp.pretty_print(MSE,num_columns=100,wrap_line=False)
 2                                                                 2                                        
c  + 183.202649006623â‹…câ‹…m - 354.560264900662â‹…c + 8499.24754966888â‹…m  - 32996.1349006623â‹…m + 32233.7605629139

Although it is not evident from the simplified expression, the MSE can only take positive values because in its original form it is a sum of squared terms. Further, the MSE grows larger for large positive or negative values of $m$ or $c$ and is an elliptic paraboloid. Because of these properties this MSE can have at most one minimum that can be found by differentiating the MSE with respect to $m$ and $c$. This could be done in SymPy and yields two linear expressions.

In [126]:
# Find differentiation w.r.t m and c
grad_m = sp.diff(MSE, m)
grad_c = sp.diff(MSE, c)
sp.pretty_print(grad_m,num_columns=100,wrap_line=False)
sp.pretty_print(grad_c,num_columns=100,wrap_line=False)
183.202649006623â‹…c + 16998.4950993378â‹…m - 32996.1349006623
2â‹…c + 183.202649006623â‹…m - 354.560264900662

We can solve the two equations and get the exact values of $m$ and $c$ that minimizes the MSE.

In [127]:
# Solve the two equations
solution = sp.solvers.solve([grad_m, grad_c])
sp.pretty_print(solution, use_unicode=True,num_columns=100)
{c: -41.4689755547575, m: 2.38805616830552}

Geometrically the MSE is the sum of sqaured lengths of the red error lines. The values of $m$ and $c$ we have got minimze the sum of squared lengths.

Let us plot the best line. In the figure below the left plot shows the training data while the right plot shows the cost function, which another name for the error function. The best line can be seen in the left plot the cost corresponding to this line is the position of the tiny sphericial marker in the right surface plot. You can play with the sliders and verify that the cost is indeed minimzed for our solution $(m,c)$.

In [163]:
reset_output()
output_notebook()
show_plots()
show_sliders()
Loading BokehJS ...

Understanding linear regression requires us to wrap our head around this transformation from the data space $(x,y)$ to parameter space $(m,c)$. The takeaway is that there are an infinite number of lines that can drawn in the data space. Each line has a unique slope $m$ and a unique intercept $c$ and an associated cost or error. Each line is thus a point in the parameter space, where the parameters $m$ and $c$ can be thought of as the x and y coordinates and the cost as the z coordinate.

A brute force method finding the best line would be to try all values of $m$ and $c$ and find the pair of values that minimizes the cost. However $m$ and $c$ are real valued and unbounded and so this brute force method will never terminate.

Fortunately we were able to determine a formula for the cost in terms of $m$ and $c$ and find the best pair $(m,c)$ that minimizes the cost. If we use the means squared error measure as our cost we can find a closed-form formula for finding the MSE (or cost) as a function of $m$ and $c$. The properties of this particular cost function lets us find the cost minimizing values of $m$ and $c$ analytically. We also saw that the mean absolute error MAE can not be minimized in this fashion.