Lab 24: Least Squares Lab ========================= In this lab you will use least squares to obtain "lines of best fit" in regression tasks. You will need to import the following: >>> import numpy as np >>> import numpy.linalg as la >>> from sklearn.datasets import load_diabetes In various scientific domains, experimental data is used to infer mathematical relationships between input variables and associated outputs. We have a set of data points :math:`\{(x_i, y_i)\}_{i=1}^n` where :math:`x_i \in \mathbb{R}^d` are the inputs and :math:`y_i \in \mathbb{R}` are the associated outputs. We desire a function that approximates the mapping from inputs (i.e., independent variables) to outputs (i.e., dependent variables). This task is referred to as **regression**; it is ubiquitous in data-driven modeling and can be solved via the method of least squares. ===================== Polynomial Regression ===================== For simplicity, let's consider the degree :math:`p` **polynomial regression** in the 1-D case, where :math:`x_i \in \mathbb{R}`. To review least squares for polynomial regression, let :math:`y := (y_1, y_2, \ldots, y_n) \in \mathbb{R}^n` be the vector of output coefficients and let :math:`X` denote the polynomial data matrix .. math:: X = \begin{pmatrix} x_1^p & x_1^{p-1} & \ldots & x_1 & 1 \\ x_2^p & x_2^{p-1} & \ldots & x_2 & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x_n^p & x_n^{p-1} & \ldots & x_n & 1 \\ \end{pmatrix} \in \mathbb{R}^{n \times (p+1)}. Any degree :math:`p` polynomial function :math:`f_p` can be identified by a set of coefficients :math:`c_0, c_1, \ldots, c_p \in \mathbb{R}`: .. math:: f_p(x) = c_0 + c_1 x + \ldots + c_{p-1}x^{p-1} + c_p x^p = \sum_{j=0}^p c_j x^j. Task 1 ------ Write a function ``compute_data_matrix(x, p)`` that accepts an input vector ``x`` and a non-negative integer power ``p``, and returns the corresponding polynomial data matrix as a ``NumPy`` array. Task 2 ------ Use the following plotting code to be able to visualize a dataset of inputs ``x`` and outputs ``y``.:: def visualize_data(x, y): assert x.size == y.size fig, ax = plt.subplots() ax.scatter(x, y) ax.set_xlabel('x') ax.set_ylabel('y') plt.show() return Now, load in the following data, visualize it, and **determine a reasonable power** :math:`p \in \mathbb{N}` **for this dataset** by running the following: >>> data = np.load("polynomial_data.npz") >>> x, y = data['x'], data['y'] >>> visualize_data(x, y) (Here, *power* refers to the degree of the polynomial that fits the shape of the data.) We now will compute the least squares fit for the data visualized previously. We seek a set of coefficients :math:`c_0, c_1, \ldots, c_p \in \mathbb{R}` such that the corresponding degree polynomial **coefficient vector** :math:`c \in \mathbb{R}^{p+1}` minimizes the **residual error**: .. math:: \|Xc - y\|^2 \leq \|Xw - y\|^2 \text{ for all } w \in \mathbb{R}^{p+1}. As discussed in class, the **normal equations** that characterize the **least squares solution** :math:`\hat{c}` are: .. math:: X^TX \hat{c} = X^T y, which is a linear system that we can use our linear algebra skills to solve. Under certain assumptions, the matrix :math:`X^TX` is invertible, and so we can write the least squares solution as .. math:: \hat{c} = (X^T X)^{-1} X^T y. The least squares fit line then is defined as the degree :math:`p` polynomial .. math:: \hat{f}_p(x) = \hat{c}_0 + \hat{c}_1 x + \ldots, \hat{c}_{p-1} x^{p-1} + \hat{c}_p x^p, which when applied to our set of inputs in the data matrix looks like :math:`f = X \hat{c} \in \mathbb{R}^n`. Task 3 ------ Replace any ``...`` to complete the function ``plot_least_squares(x, y, p)`` that will compute the least squares solution line and plot it against the data. (Equations for :math:`\hat{c}` and :math:`f` are given above.) .. code-block:: python def plot_least_squares(x, y, p): assert x.size == y.size X = ... # Compute data matrix with inputs x and polynomial power p using your compute_data_matrix function c_hat = ... # Compute the least-squares solution using X and y f = ... # Compute the least-squares output, f = X @ c_hat fig, ax = plt.subplots() ax.scatter(x, y, label='orig data') # this scatter plots the original data ax.plot(x, f, 'k--', label='LS fit') # this plots the least squares fit line ax.set_xlabel('x') ax.set_ylabel('y') ax.legend() ax.set_title(f"Least Squares Polynomial fit with degree = {p}") plt.show() return Description of inputs: * input vector ``x`` * output vector ``y`` * desired polynomial power ``p`` Using the data from Task 2, run your ``plot_least_squares`` function, trying different values of the polynomial power, :math:`p`. For example, try :math:`p = 1, 2, 3, 5, 15`. * What happens if you use a power :math:`p
`_ .
=============================================
Computing Linear Regression for Diabetes Data
=============================================
We now turn to a "real-world" regression task of predicting diabetes progression based on patient information and medical measurements.
The data is from the `sklearn diabetes dataset