{
"cells": [
{
"cell_type": "markdown",
"id": "62c7e6f5",
"metadata": {},
"source": [
"# Lecture 25 (5/23/2022)"
]
},
{
"cell_type": "markdown",
"id": "a1ea5e3d",
"metadata": {},
"source": [
"**Announcements**\n"
]
},
{
"cell_type": "markdown",
"id": "cac90fc6",
"metadata": {},
"source": [
"*Last time we covered:*\n",
"- Clustering: $k$-means and Gaussian Mixture Models\n",
"\n",
"**Today's agenda:**\n",
"- Dimensionality reduction\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa6ebecc",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"id": "c8e3d210",
"metadata": {},
"source": [
"# Dimensionality Reduction"
]
},
{
"cell_type": "markdown",
"id": "7712e22e",
"metadata": {},
"source": [
"## What is it?\n",
"\n",
"Broadly, the goal of dimensionality reduction is, as the name implies, to **take data that exists in a very high dimensional space and find a way to represent the points in that space as accurately as possible with as few dimensions as possible**.\n",
"\n",
"*Why would we want to do this?*\n",
"\n",
"- High dimensional data is harder to visualize, harder to make hypotheses about\n",
"- High dimensional data can make it more difficult to fit models\n",
" - Aside: the \"curse of dimensionality\": in machine learning settings, adding dimensions to your data will initially improve your models, but eventually, your data becomes *sparse* in very high dimensions, meaning it no longer clusters in useful ways. This will make your models **worse** or require much much more data to fit. Read about this on wikipedia [here](https://en.wikipedia.org/wiki/Curse_of_dimensionality) or on Medium [here](https://towardsdatascience.com/the-curse-of-dimensionality-50dc6e49aa1e).\n",
"\n",
"Broadly, dimensionality reduction techniques make it easier for us to visualize, reason about, and fit models to our data.\n",
"\n",
"As a result, dimensionality reduction is often used alongside other analysis techniques we've discussed in this class: before performing classification or clustering on your data, it may be helpful to first reduce it to a lower dimensional space and *then* classify based on the lower dimensional feature representation or cluster in the lower dimension."
]
},
{
"cell_type": "markdown",
"id": "302eb958",
"metadata": {},
"source": [
"## Example\n",
"\n",
"The easiest way to illustrate dimensionality reduction is with an example.\n",
"\n",
"Remember our ***gapminder*** dataset from earlier problem sets and lectures? In problem set 3, we looked at how income and population affect life expectancy. However, these are just three out of *many* different variables that gapminder tracks, including educational attainment, gender equality, internet users, the list goes on (you can play around with these different indicators on the site [here](https://www.gapminder.org/data/)). \n",
"\n",
"Now, let's say your job is to take a whole bunch of these indicators and try to figure out how different countries **cluster** across these variables. Or, maybe you want to **predict** something like life expectancy and you have all these variables at your disposal (the excellent blog [here](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c) provides a similar motivation). The problem is, many of these predictors will be redundant and highly correlated with each other so you want to find the most useful ones for a model. \n",
"\n",
"*How do you go about doing this?*\n",
"\n",
"- *Feature elimination*: rule out features that aren't as predictive one by one\n",
"- *Feature extraction*: create *new* axes that combine the existing features in ways that help describe our data\n",
" - This is what PCA does!"
]
},
{
"cell_type": "markdown",
"id": "99ed0413",
"metadata": {},
"source": [
"## Solution: Principal Components Analysis (PCA)\n",
"\n",
"\n",
"**What is it**\n",
"\n",
"Principal Component Analysis is based on a fairly simple but really powerful intuition:\n",
"\n",
">If we plot a line through our data that minimizes how far our data points are from the line and maximizes the spread of our data points *along* the line, then we can do a pretty good job describing our data in just one dimension with that line.\n",
"\n",
"Take a look at the graph below to see how we might find this sort of line through our data.\n",
"\n",
"![pca](img/pca.gif)\n",
"\n",
"This image comes from a brief tutorial about PCA [here](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) (the writing is sexist but the explanation of PCA itself is pretty good). In the image, the blue points are our original data, and the red points are our data points mapped onto the line for different possible lines through our data. The red lines show the \"error\" in this mapping. What this is showing is that when the line through our data points is aligned with the purple edges, the *projection* of our data points onto that line (in red) has the lowest error and the highest variance, which is what we want. \n",
"\n",
"**The line through the purple edges does a good job describing the overall trend in our data.**\n",
"\n",
"Now, if we can find a *second line* perpendicular to the first, we can capture *additional variance in our data* with the second line that isn't captured by the first line. PCA is often illustrated with a series of lines like this. These are called the **principal components**.\n",
"\n",
"![pca_static](img/pca_static.jpeg)\n",
"\n",
"This image comes from a really awesome tutorial [here](https://towardsdatascience.com/understanding-pca-fae3e243731d). I highly recommend reading it :)\n",
"\n",
"What defines the principal components is that they are lines that a) are perpendicular to each other, and b) capture the maximal amount of variance in our data. The *first component* captures the most variance, the second component captures the second most, etc.\n",
"\n",
"\n",
"**What are the principal components**\n",
"\n",
"*Isn't this a lot like regression??* Yes! Just like a regression line can be specified in terms of coefficients $\\beta_1$, $\\beta_2$, ..., $\\beta_n$ applied to our predictor variables $x_1$, $x_2$, ..., $x_n$, the principal components are described as a series of *weights* on each predictor variable. These weights map our data points ($x_1$, $x_2$, ..., $x_n$) onto the principal component line in the same way that regression coefficients map our data points onto the regression line. \n",
"\n",
"**However, there's one key difference** between regression and PCA. Our best fitting regression line was the one that *minimized sum of squared error in $y$*. Remember what that looked like? As you can see in the animation above, PCA is doing something slightly different; it's minimizing each data point's Euclidean distance to the line, rather than its distance in $y$ alone. This is an important difference and it means the two techniques aren't guaranteed to draw the same line. What we wanted with regression was the best way to predict $y$. What we want with PCA is the best way to predict our data across all dimensions.\n",
"\n",
"![pca_reg](img/residuals_pca.jpeg)\n",
"\n",
"This awesome graphic came from another brief tutorial about PCA [here](https://starship-knowledge.com/pca-vs-linear-regression).\n",
"\n",
"\n",
"**How to calculate the principal components**\n",
"\n",
"So how do we estimate these super handy *principal component* lines? It turns out there's a super handy *analytic* solution based on the eigenvectors and eigenvalues of our data (briefly: the principal components are the eigenvectors with the largest eigenvalues). We're not going to get into how that actually works here, but it's important that you know where they come from (we aren't *estimating* them but calculating them). If you want to read more about it, take a look at [this](https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c) tutorial or [this](https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/) one, or the derivation on the wikipedia page for PCA [here](https://en.wikipedia.org/wiki/Principal_component_analysis#Details).\n",
"\n",
"**Dimensionality reduction with the principal components**\n",
"\n",
"To tie this all together, once we've calculated our principal components, we can represent our data in fewer dimensions *by mapping each data point onto a small number of principal components*. In other words, we change each data point from being written as ($x_1$, $x_2$, ..., $x_n$) to something like ($pc_1$, $pc_2$, $pc_3$), where $pc_1$ is our data point's value when projected onto the first principal component line. Instead of the *axes* of our data being $x_1$, $x_2$, ..., we now represent our data on *new* axes $pc_1$, $pc_2$, ... that allow us to describe the pattern in our data more efficiently. "
]
},
{
"cell_type": "markdown",
"id": "65bd5317",
"metadata": {},
"source": [
"## PCA in python: getting started\n",
"\n",
"In this example, we're going to walk through how to do PCA in python. \n",
"\n",
"The sklearn `decomposition` module offers a range of other dimensionality reduction solutions. You can see the full list of those classes [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition) and a brief overview tutorial of each one [here](https://scikit-learn.org/stable/modules/decomposition.html#decompositions) (NOTE the tutorial has a lot of really spooky face images. You've been warned...).\n",
"\n",
"Let's start by revisiting an old dataset that we know has a pretty high-dimensional data representation. The `crimestats` data shows crime rates in cities throughout the US broken out by types of crimes. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "435f7ab3",
"metadata": {},
"outputs": [],
"source": [
"crime = pd.read_csv(\"https://raw.githubusercontent.com/erik-brockbank/css2_sp22-public/main/Datasets/crimestats-clean.csv\")\n",
"crime[crime.isna().any(axis = 1)]\n",
"crime = crime.dropna(how = 'any')\n",
"crime"
]
},
{
"cell_type": "markdown",
"id": "fec699ae",
"metadata": {},
"source": [
"Now, let's imagine we want to do some clustering to see whether there are notable criminal \"profiles\" of the different cities in this data. \n",
"\n",
"We'll extract just the columns with counts for each type of crime:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "912bb7dd",
"metadata": {},
"outputs": [],
"source": [
"subset_cols = ['Murder', 'Rape', 'Robbery', 'Assault', 'Burglary', 'Theft', 'GTA']\n",
"subset = crime.loc[:, subset_cols]\n",
"subset"
]
},
{
"cell_type": "markdown",
"id": "3205edf7",
"metadata": {},
"source": [
"*What does this data look like?*"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1037d0c",
"metadata": {},
"outputs": [],
"source": [
"# NB: this takes ~60s to run\n",
"sns.pairplot(subset, plot_kws = {\"alpha\": 0.5})"
]
},
{
"cell_type": "markdown",
"id": "7fbcb070",
"metadata": {},
"source": [
"Unfortunately, this data is high dimensional (7 $x$ variables) and they seem very highly correlated. \n",
"\n",
"This is a recipe for difficulties if we try to cluster it as-is.\n",
"\n",
"\n",
"\n",
"***... TIME FOR PCA!!!***"
]
},
{
"cell_type": "markdown",
"id": "351f4eb3",
"metadata": {},
"source": [
"Now, before we can apply PCA, there's an important step here, which is to *re-scale our data*.\n",
"\n",
"Remember way back in week 4 when we talked about z-scoring our data?\n",
"\n",
"Now we get to put that skill to work!\n",
"\n",
"***Why would we want to z-score our data before doing PCA?***\n",
"\n",
"- Want to make sure our data is represented in equivalent \"units\" across all dimensions\n",
"- That way, comparing spread is more equivalent"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1750257",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"subset_scaled = StandardScaler().fit_transform(subset)\n",
"subset_scaled = pd.DataFrame(subset_scaled, columns = subset.columns)\n",
"subset_scaled"
]
},
{
"cell_type": "markdown",
"id": "a3ffb98f",
"metadata": {},
"source": [
"**Now, let's fit the sklearn PCA class to our scaled data**\n",
"\n",
"[Here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)'s what this looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95dc0034",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA()\n",
"\n",
"pca.fit(subset_scaled)"
]
},
{
"cell_type": "markdown",
"id": "7c9b5b0b",
"metadata": {},
"source": [
"*How did we do?*\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "ab97d5c0",
"metadata": {},
"source": [
"## PCA in python: interpreting results\n",
"\n",
"Let's start by understanding these results in aggregate."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0ee1e1d4",
"metadata": {},
"outputs": [],
"source": [
"# How many components were there? (default: same number as dimensions of data)\n",
"pca.n_components_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81e0a3f5",
"metadata": {},
"outputs": [],
"source": [
"# how much variance do we explain with each component? (remember our data has been z-scored)\n",
"pca.explained_variance_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1654817",
"metadata": {},
"outputs": [],
"source": [
"# what proportion of the total variance is each component explaining?\n",
"pca.explained_variance_ratio_"
]
},
{
"cell_type": "markdown",
"id": "8ec02d8e",
"metadata": {},
"source": [
"Let's take a look at this proportion of total variance result. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2d4e3910",
"metadata": {},
"outputs": [],
"source": [
"sns.pointplot(x = np.arange(1, 8), y = pca.explained_variance_ratio_)\n",
"plt.xlabel(\"Principal component\")\n",
"plt.ylabel(\"Proportion of additional variance explained\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "506a9be0",
"metadata": {},
"source": [
"***What does the result above mean??***\n",
"\n",
"- We can describe most of the variance in our data with just a couple dimensions!"
]
},
{
"cell_type": "markdown",
"id": "6652de6f",
"metadata": {},
"source": [
"Now, let's take a closer look at our individual components:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2534f471",
"metadata": {},
"outputs": [],
"source": [
"pca.components_\n",
"pca.components_[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53b21924",
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x = pca.components_[0], y = subset_cols)\n",
"plt.title(\"Component 1\")\n",
"plt.show()\n",
"\n",
"sns.barplot(x = pca.components_[1], y = subset_cols)\n",
"plt.title(\"Component 2\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "74436a19",
"metadata": {},
"source": [
"***What do the plots above tell us?***\n",
"\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53816b60",
"metadata": {},
"outputs": [],
"source": [
"crime_transform = pca.transform(X = subset_scaled)\n",
"crime_transform = pd.DataFrame(crime_transform, columns = ['Component ' + str(i) for i in np.arange(1, 8)])\n",
"crime_transform['State'] = crime['State']\n",
"crime_transform['City'] = crime['City']\n",
"crime_transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b673f24d",
"metadata": {},
"outputs": [],
"source": [
"sns.scatterplot(data = crime_transform, x = \"Component 1\", y = \"Component 2\", alpha = 0.5)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "cddbb218",
"metadata": {},
"source": [
"***What does the plot above tell us?***\n",
"\n",
"..."
]
},
{
"cell_type": "markdown",
"id": "b23751ac",
"metadata": {},
"source": [
"## Wrapping up: a note on interpretability\n",
"\n",
"As you might have intuited from the examples above, principal components can sometimes be hard to interpret. \n",
"\n",
"It may require a lot of domain expertise or further analysis. For this reason, be thoughtful when using it. \n",
"\n",
"However, because dimensionality reduction is so powerful (and this solution works so well), having a smaller set of principal components that requires interpretation is likely better than trying to work with high dimensional data in its original form."
]
},
{
"cell_type": "markdown",
"id": "c539a8e2",
"metadata": {},
"source": [
"# Practicing PCA\n",
"\n",
"If we get through all the above and still have time left over, let's practice doing PCA on a different data set. \n",
"\n",
"In the code below, read in the `pokemon` dataset that we've used in prior assignments. \n",
"\n",
"Now, take a look at the columns used to evaluate a pokemon's effectiveness: `HP`, `Attack`, `Defense`, `Sp. Atk`, `Sp. Def`, and `Speed`. Each pokemon occupies a position in this high-dimensional space. This makes it difficult to evaluate them across all these measures.\n",
"\n",
"*Can you find a lower dimensional way of representing each pokemon for easier clustering or analysis?*"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "248e275d",
"metadata": {},
"outputs": [],
"source": [
"### YOUR CODE HERE\n",
"\n",
"# Read in the data\n",
"pokemon = pd.read_csv(\"https://raw.githubusercontent.com/erik-brockbank/css2_sp22-public/main/Datasets/Pokemon.csv\")\n",
"\n",
"# Use these columns as the basis for PCA\n",
"cols = ['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']\n",
"\n",
"\n",
"pokemon"
]
}
],
"metadata": {
"interpreter": {
"hash": "e774977668b7c0ae8309835a5187aa7fbf7669e7d0bb59755bc63e573643edcd"
},
"kernelspec": {
"display_name": "Python 3.7.4 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}