Seeing the bulk: a primer on dimensionality reduction
If you haven’t seen Interstellar, you might not get the reference in the title. Don’t sweat it, the bulk beings are some advanced civilization that is able to move outside the familiar 3D space that we are stuck in. A similar plot device was used in Edwin Abbott’s book Flatland.
Outside of entertainment, there is a lot of value in attempting to understand high dimensional spaces. Physics, machine learning, and engineering control systems all make use of higher-dimensional spaces in one way or another. I first got interested in higher dimensional spaces in high school, I saw a talk by a Russian physicist that introduced 4-dimensional cubes, and then made some passing comment that “there are six extra dimensions of space”, before closing up the lecture. That last part was confusing, but I later realized he was talking about Calabi-Yau manifolds from string theory.
In this post we will learn how datasets can be thought of as sets of points in a high dimensional space, and then we will learn how to map those sets into low (2 or 3)-dimensional spaces so that we can see them. The goal will be to preserve as much structure as possible, but still fit everything into a space we can see and feel.
Let’s start with 1D, real numbers are points on a line, easy. In 2D, we can use a pair (x,y) to name any point in a plane. For 3D, we can use triplets (x,y,z) to name points in a space like the one we physically live in.
If we wanted to name the four corners of a tetrahedron, we could write out a table of data like this:
But it’s not clear what that data looks like until we visualize it in 3D space:
We can extend the basic idea behind taking triples (x,y,z) and imagining them as points in a 3-dimensional space. Instead of triples, we can use n-tuples (x1, x2, …, xn) and imagining them as points in an n-dimensional space. We can then measure distances between any two points (x1, x2, …, xn) and (y1, y2, …, yn) by using the Pythagorean theorem. So all the tools we have for doing geometry with numbers (analytic geometry and linear algebra) work well, but we can’t directly visualize those spaces, since our brains evolved in a 3-dimensional environment.
The good news is that we can “compress” higher dimensional shapes into 3D and still preserve plenty of information.
Principal Components Analysis
One technique to reduce the dimensionality of a dataset is called Principal Components Analysis. It starts by modeling the dataset as a matrix, and then finds the directions in which the data varies the most. Those directions that contain the most variation coincide with the “principal components” of the matrix. For an excellent visual introduction to this technique, check out Victor Powell’s post on PCA.
As an example, we will use a wine dataset1 from 1991, it contains 13 properties (alcohol, malic acid, etc.) and 178 wines grown in the same region, but derived from three different plant varieties. Each property can be seen as a dimension, so using our spatial metaphor, we will imagine the 178 wines as points in a 13-dimensional space. Applying PCA will help us find the three most important directions, and then we can visualize those 178 points in a familiar 3D space. I’ll add colors to distinguish the wines from the three wine varieties. We should expect to see some structure, namely, the different plant varieties should form clusters.
From the (Y,Z) chart you can see that there are two main clusters, the blue and the (red/green), and from the (X,Z) chart you can see that the red clusters more towards the center of mass and the green is on the periphery. Also from the (X,Y) chart, the same pattern of “red inside”, “green outside” hold. This is a good example of how bringing your data into a lower dimensional space helps understanding and solving certain problems. Data that have these clusters can be fed into machine learning algorithms to do classification, or they can be used to find natural categories to organize data on its own terms.
How to use PCA in Python
It is very easy to use, just install
scikit-learn and run the following two lines of code:
from sklearn.decomposition import PCA X # high dimensional data, in our case a set of 178 13-dimensional points small_dim = 2 # this is the reduced dimensionality we will end up with, 2 and 3 are good choices pca = PCA(n_components = small_dim) x,y = pca.fit_transform(X)
In a later post I will describe how PCA works mathematically, the intuition behind it helps to understand its weaknesses and strengths.