Lab 18: K-Means Clustering¶
One of the current areas of high interest is data science and machine learning. Machine learning can roughly be divided into three main types, supervised learning, unsupervised learning, and reinforcement learning. Unsupervised learning, or pattern recognition involves identifying useful patterns or structure from data that is unlabeled. One example of this is clustering, where the goal is to determine points that should be clustered together. The goal of this lab is to introduce the k-means algorithm, a simple and popular clustering method.
The objective of clustering is to find a partition of the data such that points in the same subset will be “close” according to some way of measuring. There are many different ways of measuring how close two points are but we will use the Euclidean distance introduced in your text.
More formally, suppose we have a collection of \(\mathbb R^K\)-valued observations \(X = \{x_1,x_2,\ldots,x_n\}\). Let \(N\in \mathbb N\) and let \(\mathcal S\) be the set of all \(N\)-partitions of \(X\), where an \(N\)-partition is a partition with exactly \(N\) nonempty elements. We can represent a typical partition in \(\mathcal S\) as \(S = \{S_1, S_2, \ldots, S_N\}\), where
and
We seek the \(N\)-partition \(S^*\) that minimizes the within-cluster sum of squares, i.e.
where \(\mu_i\) is the mean of the elements in \(S_i\), i.e.
The K-Means Algorithm¶
Finding the global minimizing partition \(S^*\) is generally intractable since the set of partitions can be very large indeed, but the k-means algorithm is a heuristic approach that can often provide reasonably accurate results.
We begin by specifying an initial cluster mean \(\mu_i^{(1)}\) for each \(i=1,\ldots,N\) (this can be done by random initialization, or according to some heuristic). For each iteration, we adopt the following procedure. Given a current set of cluster means \(\mu^{(t)}\), we find a partition \(S^{(t)}\) of the observations such that
We then update our cluster means by computing for each \(i=1,\ldots,N\). We continue to iterate in this manner until the partition ceases to change.
The figure below shows two different clusterings of the iris data produced by the k-means algorithm. Note that the quality of the clustering can depend heavily on the initial cluster means. We can use the within-cluster sum of squares as a measure of the quality of a clustering (a lower sum of squares is better). Where possible, it is advisable to run the clustering algorithm several times, each with a different initialization of the means, and keep the best clustering. Note also that it is possible to have very slow convergence. Thus, when implementing the algorithm, it is a good idea to terminate after some specified maximum number of iterations.

Two different K-Means clusterings for the iris dataset. Notice that the clustering on the left predicts the flower species to a high degree of accuracy, while the clustering on the right is less effective.¶
The algorithm can be summarized as follows.
Choose \(k\) initial cluster centers.
For \(i=0, \ldots, \texttt{max_iter}\),
Assign each data point to the cluster center that is closest, forming \(k\) clusters.
Recompute the cluster centers as the means of the new clusters.
If the old cluster centers and the new cluster centers are sufficiently close, terminate early.
Those students planning on enrolling in the ACME program or who are completing a degree in computer science will likely have the opportunity to code up the k-means algorithm as part of the program.
For this lab we will use the built in version of k-means clustering that comes with the sklearn
package.
Task 1¶
The K-means clustering algorithm is an unsupervised classification model. You will find an implementation of this algorithm here. Read through this implementation and the usage example to understand how to use this implementation of the K-means clustering algorithm. Be sure to submit the unmodified starter code for this task to receive credit.
Detecting Active Earthquake Regions¶
Suppose we are interested in learning about which regions are prone to experience frequent earthquake activity. We could make a map of all earthquakes over a given period of time and examine it ourselves, but this, as an unsupervised learning problem, can be solved using our k-means clustering tool.

Earthquake epicenters over a 6 month period.¶
The file earthquake_coordinates.npy
contains earthquake data throughout the world from January 2010 through June 2010.
Each row represents a different earthquake; the columns are scaled longitude and latitude measurements.
We want to cluster this data into active earthquake regions.
For this task, we might think that we can regard any epicenter as a point in \(\mathbb R^2\) with coordinates being their latitude and longitude.
This, however, would be incorrect, because the earth is not flat.
Instead, latitude and longitude should be viewed in spherical coordinates in \(\mathbb R^3\), which could then be clustered.
A simple way to accomplish this transformation is to first transform the latitude and longitude values to spherical coordinates, and then to Euclidean coordinates. Recall that a spherical coordinate in \(\mathbb R^3\) is a triple \((r,\theta,\varphi)\), where \(r\) is the distance from the origin, \(\theta\) is the radial angle in the \(xy\)-plane from the \(x\)-axis, and \(\varphi\) is the angle from the \(z\)-axis. In our earthquake data, once the longitude is converted to radians it is an appropriate \(\theta\) value; the latitude needs to be offset by \(90\) degrees, then converted to radians to obtain \(\varphi\). For simplicity we can take \(r=1\), since the Earth is roughly a sphere. We can then transform to Euclidean coordinates using the following relationships.
There is one last issue to solve before clustering. Each earthquake data point has norm \(1\) in Euclidean coordinates, since it lies on the surface of a sphere of radius \(1\). Therefore, the cluster centers should also have norm \(1\). Otherwise, the means can’t be interpreted as locations on the surface of the earth, and the k-means algorithm will struggle to find good clusters. A solution to this problem is to normalize the mean vectors at each iteration, so that they are always unit vectors.
Task 2¶
We will be using the K-means algorithm to classify earthquake data; however, this data is recorded in latitudinal and longitudinal coordinates. This is problematic as applying the 2-norm (Euclidean distance) to this coordinate system fails to yield the measures we expect to see because of distortion. As such, we must convert these coordinates to 3-dimensional Euclidean coordinates before running the K-means algorithm.
Write a function, ll_to_euc(X)
, that takes an array of longitudinal and latitudinal (in that order) coordinates, X
, and converts them to 3-dimensional Euclidean coordinates. Do this by writing and using two functions: a function, ll_to_sph(X)
, that takes an array of longitudinal and latitudinal (in that order) coordinates, X
, and converts them into spherical coordinates and a function, sph_to_euc(X)
, that takes an array of spherical coordinates, X
, and converts them into 3-dimensional Euclidean coordinates.
Task 3¶
We would still like to plot our data in longitudinal and latitudinal coordinates. As such, we need to be able to convert from 3-dimensional coordinates back to longitudinal and latitudinal coordinates.
Write a function, euc_to_ll(X)
, that takes an array of 3-dimensional Euclidean coordinates, X
, and converts them to longitudinal and latitudinal (in that order) coordinates. Do this by writing and using two functions: a function, euc_to_sph(X)
, that takes an array of 3-dimensional Euclidean coordinates, X
, and converts them into spherical coordinates and a function, sph_to_ll(X)
, that takes an array of spherical (in that order) coordinates, X
, and converts them into longitudinal and latitudinal coordinates.
Task 4¶
Use your code from the previous tasks alongside the provided K-means code to write a function, classify_geo(X, k, seed)
, that takes an array of geographical data, X
, and does the following:
Convert the geographical data,
X
, to 3-dimensional Euclidean coordinatesFit a K-means classifier on the Euclidean data with the given
k
andseed
Predict the clusters for the Euclidean data with the fitted classifier
Convert the
centers
from the fitted classifier to longitudinal and latitudinal coordinatesPlot the original data,
X
, colored by its classification labels.
Make sure to set normalize=True
for your K-means classifier and to use seed=42
. Use np.load
to read .npy
files and np.loadtxt
to read .txt
files into a np.ndarray
.
Color Quantization¶
The k-means algorithm uses the euclidean metric, so it is natural to cluster geographic data. However, clustering can be done in any abstract vector space. The following application is one example.
Images are usually represented on computers as 3-dimensional arrays. Each 2-dimensional layer represents the red, green, and blue color values, so each pixel on the image is really a vector in \(\mathbb R^3\). Clustering the pixels in RGB space leads a one kind of image segmentation that facilitate memory reduction. (If you would like to read more – go to https://en.wikipedia.org/wiki/Color_quantization)
Task 5a¶
Write a function, quantize_image(X, n_clusters, n_samples, seed)
, that takes a color image array, X
(shape (m, n, 3)
), a number of clusters, n_clusters
, and a number of samples, n_samples
, and an optional seed for random number generation, seed
, and does the following:
Flatten the image such that each row represents a single pixel (shape
(m * n, 3)
)Set the seed for random number generation if given (use
np.random.seed(seed)
)Randomly sample
n_samples
pixels (shape(n_samples, 3)
) from the flattened image uniformly (usenp.random.randint
)Fit a K-means classifier of
n_clusters
clusters to the random samplePredict the clusters for the entire flattened image using the fitted classifier
Recolor the pixels in the flattened image to the coloration of their corresponding cluster centers
Unflatten the recolored image (shape
(m, n, 3)
)Return the recolored image
Do NOT change the original image during any part of this process (use np.copy
or X.copy()
before performing any of the above steps). Also, you will not need to specify the seed
for your K-means classifier since you will set it manually. Your code will be tested using the file given in CodeBuddy.
Task 5b¶
Use your code from the previous exercise to perform color quantization on one of the images in CodeBuddy until you generate an image you are satisfied with.