{
"cells": [
{
"cell_type": "markdown",
"id": "62c7e6f5",
"metadata": {},
"source": [
"# Lecture 16 (5/2/2022)"
]
},
{
"cell_type": "markdown",
"id": "a1ea5e3d",
"metadata": {},
"source": [
"**Announcements**\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "cac90fc6",
"metadata": {},
"source": [
"*Last time we covered:*\n",
"- Linear regression\n",
" - Simple linear regression\n",
" - Evaluating regression\n",
" - Polynomial and multiple regression\n",
"\n",
"**Today's agenda:**\n",
"- Multiple regression continued\n",
"- Overfitting\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa6ebecc",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"id": "c8e3d210",
"metadata": {},
"source": [
"# Multiple Regression\n",
"\n",
"\n",
"## Big idea: multiple predictors for $y$\n",
"\n",
"In real-world scenarios, it's common to explore the relationship between *multiple* predictor variables that all jointly impact the dependent variable. \n",
"\n",
"For this, we need **multiple regression**. \n",
"\n",
"Though the guiding principles are similar to *simple regression*, things can get much more complicated in multiple dimensions. Here, we'll provide enough of an overview for you to be able to use multiple regression as part of your modeling toolbox."
]
},
{
"cell_type": "markdown",
"id": "b9ad3133",
"metadata": {},
"source": [
"## Example: predicting life expectancy\n",
"\n",
"Let's start with a familiar example.\n",
"\n",
"In Problem Set 3 and elsewhere, we've looked at the *gapminder* dataset, which explores the relationship between GDP, population, and life expectancy at the national level and over time. \n",
"\n",
"Here, we'll just look at these variables for the most recent year (2007)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90dbfdcb",
"metadata": {},
"outputs": [],
"source": [
"gap = pd.read_csv(\"https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/gapminder.csv\")\n",
"\n",
"# Let's keep just some of the variables (note for pset!)\n",
"# NOTE: we're also just restricting ourselves to 2007\n",
"gap_subset = gap.loc[gap['year'] == 2007, ('country', 'year', 'lifeExp', 'pop', 'gdpPercap')].reset_index(drop = True)\n",
"\n",
"# Add log transformed population and income\n",
"gap_subset['logPop'] = np.log10(gap_subset['pop'])\n",
"gap_subset['logGdpPercap'] = np.log10(gap_subset['gdpPercap'])\n",
"gap_subset"
]
},
{
"cell_type": "markdown",
"id": "8e9062b1",
"metadata": {},
"source": [
"In Problem Set 3, you generated a graph that explored the relationship between income (GDP per-capita on $x$) and life expectancy (in years on $y$), alongside two additional predictors: region (color) and population (size).\n",
"\n",
"![gap](img/gapminder.png)\n",
"\n",
"Here, we want to know formally what the relationship is between the three continuous variables (income, life expectancy, and population). In other words, **can we predict life expectancy using both income *and* population better than we could only using one of those variables?**\n",
"\n",
"Let's start by just examining each of the predictors in isolation to see if this is a plausible hypothesis."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1dafd813",
"metadata": {},
"outputs": [],
"source": [
"g = sns.lmplot(data = gap_subset, \n",
" x = \"logGdpPercap\", # x1 \n",
" y = \"lifeExp\"\n",
" )\n",
"plt.title(\"Life expectancy as a function of income\")\n",
"plt.xlabel(\"Income (log GDP / capita)\")\n",
"plt.ylabel(\"Life expectancy\")\n",
"plt.show()\n",
"\n",
"h = sns.lmplot(data = gap_subset, \n",
" x = \"logPop\", # x2 \n",
" y = \"lifeExp\"\n",
" )\n",
"plt.title(\"Life expectancy as a function of population\")\n",
"plt.xlabel(\"Population (log)\")\n",
"plt.ylabel(\"Life expectancy\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a37200f8",
"metadata": {},
"source": [
"One of these variables has a strong positive relationship. The other one seems a bit less clear. \n",
"\n",
"How can we think about exploring the role that both of them play together in predicting life expectancy?"
]
},
{
"cell_type": "markdown",
"id": "771bdac7",
"metadata": {},
"source": [
"### Multiple regression: overview\n",
"\n",
"Multiple regression is like linear regression except that we assume our dependent variable $y_i$ is *jointly* predicted by multiple independent variables: \n",
"\n",
"$x_1$, $x_2$, ..., $x_n$.\n",
"\n",
"Simple linear regression generates a *predicted* $\\hat{y}_i$ from $x_i$:\n",
"\n",
"$\\hat{y}_i = \\beta_0 + \\beta_1 x_i + \\epsilon_i$\n",
"\n",
"With multiple regression, we now extend this model to include multiple predictors for our data ($x_{1i}$, $x_{2i}$, ..., $x_{ni}$):\n",
"\n",
"$\\hat{y}_i = \\beta_0 + \\beta_1 x_{1i} + \\beta_2 x_{2i} + \\ ... \\ + \\beta_n x_{ni} + \\epsilon_i $\n",
"\n",
"In most cases, multiple regression once again uses the same *Ordinary Least Squares* (OLS) parameter estimation as simple regression. However, interpreting the parameter estimates is a little less straightforward.\n",
"\n",
"*How would we interpret $\\beta_0 = 1$, $\\beta_1 = 2$, $\\beta_2 = 3$?*\n",
"\n",
"(Think of a simple example. What variables would you want to use to predict somebody's height *as accurately as possible*? Now imagine each of those variables is one of your $x_{ji}$ variables.)"
]
},
{
"cell_type": "markdown",
"id": "aa9a8290",
"metadata": {},
"source": [
"## Multiple regression in python\n",
"\n",
"The scikit-learn `LinearRegression` class is very easy to extend to multiple regression!\n",
"\n",
"We'll also demonstrate the `statsmodels` approach below for more robust statistical analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe8dd1f8",
"metadata": {},
"outputs": [],
"source": [
"# scikit learn approach\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"x_vals = np.array(gap_subset[['logGdpPercap', 'logPop']]) # Note: this double bracket syntax is important!\n",
"x_vals = x_vals.reshape(len(gap_subset), 2)\n",
"x_vals\n",
"\n",
"y_vals = np.array(gap_subset['lifeExp'])\n",
"y_vals\n",
"\n",
"mod = LinearRegression().fit(X = x_vals, y = y_vals)\n",
"\n",
"mod.intercept_\n",
"mod.coef_"
]
},
{
"cell_type": "markdown",
"id": "39b63367",
"metadata": {},
"source": [
"What do the coefficient estimates above suggest? \n",
"\n",
"How should we interpret them? "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0c1847b5",
"metadata": {},
"outputs": [],
"source": [
"# R^2 for our regression\n",
"mod.score(X = x_vals, y = y_vals)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "741b9e6f",
"metadata": {},
"outputs": [],
"source": [
"# What if we don't include the population predictor?\n",
"x_single = np.array(gap_subset['logGdpPercap']).reshape(len(gap_subset), 1)\n",
"x_single\n",
"\n",
"mod_simple = LinearRegression().fit(X = x_single, y = y_vals)\n",
"mod_simple.score(X = x_single, y = y_vals)\n",
"\n",
"# Doesn't seem like population helps much..."
]
},
{
"cell_type": "markdown",
"id": "98d9d518",
"metadata": {},
"source": [
"The statsmodels approach gives us a bit more insight here."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec729dad",
"metadata": {},
"outputs": [],
"source": [
"# statsmodels approach\n",
"import statsmodels.formula.api as smf\n",
"\n",
"multiple_reg = smf.ols('lifeExp ~ logGdpPercap + logPop', data = gap_subset).fit()\n",
"\n",
"# View the results\n",
"multiple_reg.summary()\n",
"\n",
"# Our population variable does have a significant positive slope, \n",
"# but it's pretty small and the effect may be driven by outliers."
]
},
{
"cell_type": "markdown",
"id": "6a4b2bb1",
"metadata": {},
"source": [
"# Regression: wrap-up\n",
"\n",
"Regression is one of the most powerful tools in our data science / analysis toolkit. \n",
"\n",
"However, it's important to be familiar with the *limitations* in regression as well. \n",
"\n",
"One class of these limitations is when **the data violate the assumptions of linear regression**. We're *not* going to get into these issues in this class, but it's good to be aware that not all data (even data that looks linear!) can be accurately described by linear regression. There are a number of tricks for diagnosing this, such as plotting your residuals (we may explore this on the problem set as a way to cover some of this material).\n",
"\n",
"However, an important limitation that we *will* discuss now is when **regression overfits the data**. Avoiding overfitting is something that arises in just about any modeling context. It's easiest to illustrate with regression, but the things we discuss here will be relevant throughout the remainder of the quarter."
]
},
{
"cell_type": "markdown",
"id": "f721881b",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"id": "3c0cf97d",
"metadata": {},
"source": [
"# Overfitting\n",
"\n",
"\n",
"## What is it?\n",
"\n",
"The basic principle behind overfitting is that when your modeling approach is sufficiently *unconstrained*, you can fit patterns in your data that *do not generalize*. \n",
"\n",
"What does it mean for a model to be unconstrained? What does it mean for the model not to generalize?\n",
"\n",
"...\n",
"\n",
"\n",
"*How does this happen?* \n",
"\n",
"There are lots of ways this can arise (and what it looks like differs across modeling techniques).\n",
"\n",
"Let's look at a canonical example just to get an intuition. For models that we learn about throughout the rest of the quarter, it's a useful exercise to think about what overfitting might look like."
]
},
{
"cell_type": "markdown",
"id": "8365fc05",
"metadata": {},
"source": [
"## Example\n",
"\n",
"Let's return to our `gapminder` dataset and the task of predicting *life expectancy* ($y$) based on *income* ($x$).\n",
"\n",
"What kind of relationship do we think these variables have in the data below?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b78a3b05",
"metadata": {},
"outputs": [],
"source": [
"gap_subset\n",
"\n",
"\n",
"g = sns.scatterplot(data = gap_subset, \n",
" x = \"logGdpPercap\", # x1 \n",
" y = \"lifeExp\",\n",
" color = \"r\",\n",
" alpha = 0.5\n",
" )\n",
"plt.title(\"Life expectancy as a function of income\")\n",
"plt.xlabel(\"Income (log GDP / capita)\")\n",
"plt.ylabel(\"Life expectancy\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "00a787c7",
"metadata": {},
"source": [
"Let's start with a **simple linear regression**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b090fb9",
"metadata": {},
"outputs": [],
"source": [
"# Estimate the model\n",
"x_vals = np.array(gap_subset['logGdpPercap']).reshape(len(gap_subset), 1)\n",
"y_vals = np.array(gap_subset['lifeExp'])\n",
"deg1_fits = LinearRegression().fit(X = x_vals, y = y_vals)\n",
"\n",
"# Add predictions for this model to our dataframe\n",
"preds = gap_subset.loc[:, (\"logGdpPercap\", \"lifeExp\")].reset_index(drop = True)\n",
"simple_preds = deg1_fits.predict(X = x_vals)\n",
"\n",
"\n",
"preds['deg1_pred'] = simple_preds\n",
"\n",
"preds"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7bf8b4a6",
"metadata": {},
"outputs": [],
"source": [
"# Plot the fit\n",
"\n",
"sns.scatterplot(data = preds, \n",
" x = \"logGdpPercap\",\n",
" y = \"lifeExp\",\n",
" color = \"m\",\n",
" alpha = 0.5\n",
" )\n",
"sns.lineplot(data = preds,\n",
" x = \"logGdpPercap\",\n",
" y = \"deg1_pred\"\n",
" )\n",
"\n",
"\n",
"plt.title(\"Life expectancy as a function of income\")\n",
"plt.xlabel(\"Income (log GDP / capita)\")\n",
"plt.ylabel(\"Life expectancy\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "b7a90a38",
"metadata": {},
"source": [
"How did we do? Maybe this data is *quadratic* or *cubic*?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11c9ec86",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"# Fit degree 2 polynomial regression\n",
"poly2 = PolynomialFeatures(degree = 2, include_bias = False)\n",
"x2_features = poly2.fit_transform(x_vals)\n",
"mod2 = LinearRegression().fit(x2_features, y_vals)\n",
"\n",
"# Fit degree 3 polynomial regression\n",
"poly3 = PolynomialFeatures(degree = 3, include_bias = False)\n",
"x3_features = poly3.fit_transform(x_vals)\n",
"mod3 = LinearRegression().fit(x3_features, y_vals)\n",
"\n",
"# Add these to our predictions\n",
"preds['deg2_pred'] = mod2.predict(X = x2_features)\n",
"preds['deg3_pred'] = mod3.predict(X = x3_features)\n",
"preds\n",
"\n",
"# How do these predictions look at first glance? More accurate?"
]
},
{
"cell_type": "markdown",
"id": "0ee4681c",
"metadata": {},
"source": [
"Let's plot these lines and see how they look:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "059466ea",
"metadata": {},
"outputs": [],
"source": [
"# First, let's make our data tidy for easier plotting\n",
"preds_long = preds.melt(\n",
" id_vars = [\"logGdpPercap\", \"lifeExp\"],\n",
" var_name = \"degree\",\n",
" value_name = \"prediction\"\n",
")\n",
"\n",
"preds_long\n",
"\n",
"# Now, let's take a look\n",
"sns.scatterplot(data = preds_long, \n",
" x = \"logGdpPercap\",\n",
" y = \"lifeExp\",\n",
" color = \"m\",\n",
" alpha = 0.1\n",
" )\n",
"\n",
"sns.lineplot(\n",
" data = preds_long,\n",
" x = \"logGdpPercap\",\n",
" y = \"prediction\",\n",
" hue = \"degree\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b1818400",
"metadata": {},
"source": [
"These lines look pretty similar. Do we see any change in $R^2$ values for these models?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "af23f6e3",
"metadata": {},
"outputs": [],
"source": [
"# Get R^2 for each model\n",
"mod1_rsq = deg1_fits.score(X = x_vals, y = y_vals)\n",
"mod2_rsq = mod2.score(X = x2_features, y = y_vals)\n",
"mod3_rsq = mod3.score(X = x3_features, y = y_vals)\n",
"\n",
"print(\"Degree 1 R^2: {} \\nDegree 2 R^2: {} \\nDegree 3 R^2: {}\".format(\n",
" mod1_rsq, mod2_rsq, mod3_rsq\n",
"))"
]
},
{
"cell_type": "markdown",
"id": "c84e6fcb",
"metadata": {},
"source": [
"*What does this tell us?*\n",
"\n",
"Probably don't need 2nd or 3rd degree polynomials for this relationship. \n",
"\n",
"\n",
"*What happens if we try to fit a much higher-order polynomial function to this data?* Similar line or something else?\n",
"\n",
"???\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77ad8030",
"metadata": {},
"outputs": [],
"source": [
"# Fit degree 10 polynomial regression\n",
"poly10 = PolynomialFeatures(degree = 10, include_bias = False)\n",
"x10_features = poly10.fit_transform(x_vals)\n",
"mod10 = LinearRegression().fit(x10_features, y_vals)"
]
},
{
"cell_type": "markdown",
"id": "5ff1fbda",
"metadata": {},
"source": [
"Let's compare this one to our $R^2$ values above:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a59c601",
"metadata": {},
"outputs": [],
"source": [
"# Calculate R^2 for our degree 10 polynomial\n",
"mod10_rsq = mod10.score(X = x10_features, y = y_vals)\n",
"mod10_rsq"
]
},
{
"cell_type": "markdown",
"id": "d7e8dcc6",
"metadata": {},
"source": [
"Improvement! Let's keep going! "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f5755582",
"metadata": {},
"outputs": [],
"source": [
"# Fit degree 25 polynomial regression\n",
"poly25 = PolynomialFeatures(degree = 25, include_bias = False)\n",
"x25_features = poly25.fit_transform(x_vals)\n",
"mod25 = LinearRegression().fit(x25_features, y_vals)\n",
"\n",
"# Check R^2\n",
"mod25_rsq = mod25.score(X = x25_features, y = y_vals)\n",
"mod25_rsq"
]
},
{
"cell_type": "markdown",
"id": "4f39fa87",
"metadata": {},
"source": [
"Okay maybe things are slowing down... \n",
"\n",
"Let's plot our functions and see how they look."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1eb7097",
"metadata": {},
"outputs": [],
"source": [
"# Add to predictions dataframe and re-format\n",
"preds['deg10_pred'] = mod10.predict(X = x10_features)\n",
"preds['deg25_pred'] = mod25.predict(X = x25_features)\n",
"\n",
"preds_long = preds.melt(\n",
" id_vars = [\"logGdpPercap\", \"lifeExp\"],\n",
" var_name = \"degree\",\n",
" value_name = \"prediction\"\n",
")\n",
"\n",
"preds_long\n",
"\n",
"# Now, let's take a look\n",
"sns.scatterplot(data = preds_long, \n",
" x = \"logGdpPercap\",\n",
" y = \"lifeExp\",\n",
" color = \"m\",\n",
" alpha = 0.1\n",
" )\n",
"\n",
"sns.lineplot(\n",
" data = preds_long,\n",
" x = \"logGdpPercap\",\n",
" y = \"prediction\",\n",
" hue = \"degree\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "66e8241d",
"metadata": {},
"source": [
"Hmmm...."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f8cdce5",
"metadata": {},
"outputs": [],
"source": [
"sns.lineplot(\n",
" data = preds,\n",
" x = \"logGdpPercap\",\n",
" y = \"deg10_pred\",\n",
" color = \"r\"\n",
")\n",
"plt.xlabel(\"Income\")\n",
"plt.ylabel(\"Predicted life expectancy\")\n",
"plt.show()\n",
"\n",
"sns.lineplot(\n",
" data = preds,\n",
" x = \"logGdpPercap\",\n",
" y = \"deg25_pred\",\n",
" color = \"b\"\n",
")\n",
"plt.xlabel(\"Income\")\n",
"plt.ylabel(\"Predicted life expectancy\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "53fdd74c",
"metadata": {},
"source": [
"This doesn't seem like it's capturing the underlying process that generated our data...\n",
"\n",
"![spaghetti](img/spaghetti.jpeg)\n",
"\n",
"Aside: overfitting is easiest to demonstrate with polynomial regression because it makes Least Squares Spaghetti. However, this problem can arise in almost any modeling context. Think about what this might look like with *multiple regression*."
]
},
{
"cell_type": "markdown",
"id": "c0e71129",
"metadata": {},
"source": [
"## Overfitting summary\n",
"\n",
"The high-degree polynomials above are **unconstrained**, leading them to fit our data *too* well.\n",
"\n",
"This creates worries that they won't **generalize**. \n",
"\n",
"*But how do we test this?*\n"
]
},
{
"cell_type": "markdown",
"id": "65daa0e5",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"id": "2e210c1e",
"metadata": {},
"source": [
"# Cross-validation: avoiding overfitting\n",
"\n",
"\n",
"There are many different techniques for *cross-validation*, one of which we discussed on Friday. \n",
"\n",
"However, at their heart, all of them involve the same underlying principle: *estimate your model parameters with only a subset (or repeated subsets) of your data, then evaluate your model on the remaining data to ensure generalizability*. \n",
"\n",
"This technique allows you to ensure that your model is a good fit, or do *model comparison* to choose among different candidate models (such as our polynomial models above).\n",
"\n",
"Rather than go through all of the different ways of doing this, I'll walk through a few of the more well known ones. This will give you a feel for how cross-validation works. \n",
"\n",
"1. ***Holdout* cross-validation**: this is the technique we reviewed on Friday\n",
" - Split data into *training* and *test* data (e.g. 75-25 split)\n",
" - Train model on *training data only*\n",
" - Evaluate model on *test data*\n",
"2. ***k-fold* cross-validation**: This is a more robust version of holdout cross-validation\n",
" - Split data into *k* equal sized buckets (randomly!)\n",
" - Choose one of the buckets to be *test* data and assign the rest to be *training* data\n",
" - Perform holdout cross-validation with this data\n",
" - Then, repeat the above *for each of the other buckets* and average your test results (e.g., MSE) over the multiple runs\n",
"3. ***Monte-carlo* cross-validation** (repeated random subsampling): Similar to k-fold, but *random*\n",
" - Split the data into training and test data randomly\n",
" - Evaluate model on test data\n",
" - Repeat this process many times\n",
"4. **Exhaustive cross-validation** (various methods): The most common of these are *leave-one-out* or *leave-p-out* cross-validation\n",
" - For every individual value in the data or set of $p$ values, train the model on all remaining data then evalute the model using this individual or set of $p$ values\n",
" - Note: this is *exhaustive* because every data point or set of $p$ data points is used in the test set, but is often computationally intractable!\n",
"5. **Other** methods: specific to particular types of data (e.g., time series data, classification data)\n",
"\n",
"\n",
"For more info on this, you can check out [this](https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d) blog, or the wikipedia [page](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) devoted to this topic, or many others like it online!"
]
},
{
"cell_type": "markdown",
"id": "5be4a087",
"metadata": {},
"source": [
"# Cross-Validation Summary\n",
"\n",
"The techniques above are used with many if not all of the modeling techniques we'll discuss this quarter. \n",
"\n",
"From here on out, you should **assume that some form of cross-validation is a *required* component of any model fitting process**, much like looking at your data, cleaning it, and other core aspects of analysis."
]
}
],
"metadata": {
"interpreter": {
"hash": "e774977668b7c0ae8309835a5187aa7fbf7669e7d0bb59755bc63e573643edcd"
},
"kernelspec": {
"display_name": "Python 3 (clean)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}