Lab 21: Least Squares Lab¶
In this lab you will use least squares to obtain “lines of best fit” in regression tasks. You will need to import the following:
>>> import numpy as np
>>> import numpy.linalg as la
>>> from sklearn.datasets import load_diabetes
In various scientific domains, experimental data is used to infer mathematical relationships between input variables and associated outputs. We have a set of data points \(\{(x_i, y_i)\}_{i=1}^n\) where \(x_i \in \mathbb{R}^d\) are the inputs and \(y_i \in \mathbb{R}\) are the associated outputs. We desire a function that approximates the mapping from inputs (i.e., independent variables) to outputs (i.e., dependent variables). This task is referred to as regression; it is ubiquitous in data-driven modeling and can be solved via the method of least squares.
Polynomial Regression¶
For simplicity, let’s consider the degree \(p\) polynomial regression in the 1D case, where \(x_i \in \mathbb{R}\).
To review least squares for polynomial regression, let \(y := (y_1, y_2, \ldots, y_n) \in \mathbb{R}^n\) be the vector of output coefficients and let \(X\) denote the polynomial data matrix
Any degree \(p\) polynomial function \(f_p\) can be identified by a set of coefficients \(c_0, c_1, \ldots, c_p \in \mathbb{R}\):
Task 1¶
Write a function compute_data_matrix(x, p)
that accepts an input vector x
and a non-negative integer power p
, and returns the corresponding polynomial data matrix as a Numpy
array.
Task 2¶
Use the following plotting code to be able to visualize a dataset of inputs x
and outputs y
.:
def visualize_data(x, y):
assert x.size == y.size
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
return
Now, load in the following data, visualize it, and determine a reasonable power \(p \in \mathbb{N}\) for this dataset by running the following:
>>> data = np.load("polynomial_data.npz")
>>> x, y = data['x'], data['y']
>>> visualize_data(x, y)
(Here, power refers to the degree of the polynomial that fits the shape of the data.)
We now will compute the least squares fit for the data visualized previously.
We seek a set of coefficients \(c_0, c_1, \ldots, c_p \in \mathbb{R}\) such that the corresponding degree polynomial coefficient vector \(c \in \mathbb{R}^{p+1}\) minimizes the residual error:
As discussed in class, the normal equations that characterize the least squares solution \(\hat{c}\) are:
which is a linear system that we can use our linear algebra skills to solve. Under certain assumptions, the matrix \(X^TX\) is invertible, and so we can write the least squares solution as
The least squares fit line then is defined as the degree \(p\) polynomial
which when applied to our set of inputs in the data matrix looks like \(f = X \hat{c} \in \mathbb{R}^n\).
Numpy’s numerical linear algebra library contains an optimized version of the least squares method, called numpy.linalg.lstsq
and so we will use it.
See the Numpy reference here.
Task 3¶
Complete the following function plot_least_squares(x, y, p)
to compute the least squares solution line and plot it against the data.:
def plot_least_squares(x, y, p):
assert x.size == y.size
X = ... # Compute data matrix with inputs x and polynomial power p
c_hat = ... # Compute the least squares solution using X and y
f = ... # Compute the least squares output, f = X \hat{c}
fig, ax = plt.subplots()
ax.scatter(x, y, label='orig data') # this scatter plots the original data
ax.plot(x, f, 'k--', label='LS fit') # this plots the least squares fit line
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend()
ax.set_title(f"Least Squares Polynomial fit with degree = {p}")
plt.show()
return
Description of inputs:
input vector
x
output vector
y
desired polynomial power
p
Using the data from Task 2, run your plot_least_squares
function, trying different values of the polynomial power, \(p\).
For example, try \(p = 1, 2, 3, 5, 15\).
What happens if you use a power \(p<P\), where \(P\) is the degree determined in Task 2?
What happens if you use a much larger power than \(P\)?
Computing Linear Regression for Diabetes Data¶
We now turn to a “real-world” regression task of predicting diabetes progression based on patient information and medical measurements. Let’s compute a linear regression fit to the data in order to predict the disease progression; that is, we will identify coefficients \(\beta_0, \beta_1, \ldots, \beta_d \in \mathbb{R}\) to approximate the mapping \(x_i \mapsto y_i\)
Thus, with a set of data \(\{x_i, y_i\}_{i=1}^n \subset \mathbb{R}^d \times \mathbb{R}\), we are computing the least squares fit with data matrix
and output vector \(y = (y_1, y_2, \ldots, y_n) \in \mathbb{R}^n\).
Task 4¶
Write a function compute_diabetes_fit()
that computes the least squares fit coefficient vector \(\hat{\beta} \in \mathbb{R}^{d+1}\) that is computed by linalg.lstsq()
.
You may find the following code and guidelines to be useful:
>>> diabetes_data = load_diabetes() # download the diabetes dataset
>>> X, y = diabetes_data.data, diabetes_data.target # extract the data matrix (X) and targets (outputs) vector y
Extend the data matrix with a column of ones to model the offset. Hint: See the numpy function
numpy.hstack
.Compute the least squares fit coefficient vector via
numpy.linalg.lstsq
on this data.
To conclude, we now analyze the linear regression coefficient result vector, \(\hat{\beta}\). The sign of an entry of this coefficient vector indicates the whether or not there is a positive or negative correlation between the corresponding input variable and the output data, conditioned on the other data. For example, if \(\hat{\beta}_j >0\) and is large, then holding all else equal, a small change in the \(j^{th}\) feature will lead to a large, positive change in the output.
Task 5¶
Extend the capability of your previous function compute_diabetes_fit()
to perform the following:
Extract the feature names of this dataset (
feature_names = diabetes_data.feature_names
)Print out the feature names of the most positive and most negative regression coefficients. Record the names and values of those coefficients in CodeBuddy.
NOTE: Be careful to account for indexing of the offset column in the data matrix X
. This is not accounted for in the feature_names
list.
You can check out the description of all the different features (independent variables) in this diabetes dataset by printing out
>>> print(diabetes_data.DESCR)