Linear regression

Linear regression is a method that is known by many names, but to put it in it's simplest form; it's fitting a plane to some data points. In 2 dimensions this would be a straight line, in 3 dimensions this would be a flat plane, like a sheet of paper. If you have 2 points and draw a straight line between them; congratulations, you have just done a linear regression in your head. Of course, not all data is so simple, generally there isn't an easily identifiable straight line that passes through all your data, and the number of dimensions of the input data mean that unless you have 1 input feature, you'll likely be fitting a multi-dimensional plane to your data rather a line (which is just a line in more dimensions anyway).

There are many ways to formulate and solve a linear regression example. You may have already come across the trend lines you can fit in Excel or other packages, which are an example of linear regression. I would recommend that you first check out the inverse theory section which covers the mathematical background of linear regression. Even though the terminology is framed from an inverse theory perspective, the foundations provided as just as relevant in data science. The most relevant section is available here.

In this section, we'll cover what linear regression means from a data science point of view. A point of note, although we have mentioned linear regression, polynomial regression or fitting of any other type of curve, such as a exponential or logarithm, is typically dealt with in data science via feature engineering. So, if you want to fit a second order polynomial (a quadratic function) to some data, then instead of passing in just the input features, you pass in the input features and the square of the input features as well. This is equivalent to when we form the data kernel in the inverse theory approach.

There is an example of linear regression in the introductory section of basic data science solutions in python here that uses the tensorflow package to fit a single layer neural network with 10 neurons to the toy diabetes dataset from sci-kit learn. If you want an overview of what a neuron is in a neural network then see the neural networks building block page here.

In this topic we'll go through:

The basic steps of implementing linear regression in to solve a data science problem, such as that given in introductory section of basic data science solutions in python (here).
An example of using basic feature engineering to solve a problem that looks anything but linear
Metrics that can be used to assess how well our linear regression models fit our data.

Linear regression - the basics

As has already been mentioned, the mathematics behind linear regression are covered in the inverse theory section here. Although more computationally efficient algorithms than inverting a large matrix are typically employed. For brevity we'll recap the topic here.

When thinking of linear regression, it's easiest to start with thinking about fitting a straight line. The equation of a straight line is, in 2 dimensions:

$$ y = mx +c $$

Here, x is our input feature, we only have 1 in this example and y is our target variable. m and c are the gradient and intercept of the straight line respectively and represent our model parameters to be fit.

If we had another input variable, we'll call it "z", then the equation can be extended to 3 dimensions rather easily:

$$ y = kz + mx +c $$

Now we have the extra input feature, z, and it's associated gradient, denoted "k" here. However, for a large number of input features this equation would become rather cumbersome, so instead we amalgamate all the model variables into a vector, m, and the input features into a matrix, G. If we denote our target feature as the variable "d", then we arrive at the equation given in the inverse theory literature.

$$ d = Gm $$

The solution of which is given in the inverse theory section. But what is important from a data science perspective is the understanding that the forward operator, G, contains all the input features we wish to use to model our input features to our target variable. Some packages, such as statsmodels will require that you specifically account for an intercept by adding a constant input feature, where as others such as sci-kit learns linearRegression will fit a constant by default (although you can turn this option off if required).

We need to make sure that any input feature we want to use to regress against the target variable is included in the input feature set presented to the regression algorithm.

It's important to note that in data science there are many different names for linear regression which typically vary only in the type of regularization applied to the model variables to be fit. For example, an L2 penalty on the size of the model variables, what would be called Tikhonov regularization in inverse theory, is referred to as Ridge regression. Similarly, a model which applies an L1 penalty on the model is referred to as LASSO and ElasticNet models apply both of those regularization techniques. Regularization is an important part of any fitting algorithm as not all data is recorded equal and typically we have some idea of what defines a good model so the choice of algorithm (or regularization) should be thought about carefully.

Let's have a look at how to apply feature engineering with a practical example

Feature engineering for a more powerful linear regression

How good is our model? - Goodness of fit

Logistic regression

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.