Lecture 12 (4/22/2022)

Announcements

Last time we covered:

  • Tidy data (wide / long format)

Today’s agenda:

  • Data transformations

Part 1: logarithmic transformations

Part 2: z-scoring

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Part 1: log transformations

Problem: highly skewed data

To get a sense of when we may need to transform our data, let’s take a look at our gapminder dataset once again:

gap = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/gapminder.csv")

gap
Unnamed: 0 country continent year lifeExp pop gdpPercap
0 1 Afghanistan Asia 1952 28.801 8425333 779.445314
1 2 Afghanistan Asia 1957 30.332 9240934 820.853030
2 3 Afghanistan Asia 1962 31.997 10267083 853.100710
3 4 Afghanistan Asia 1967 34.020 11537966 836.197138
4 5 Afghanistan Asia 1972 36.088 13079460 739.981106
... ... ... ... ... ... ... ...
1699 1700 Zimbabwe Africa 1987 62.351 9216418 706.157306
1700 1701 Zimbabwe Africa 1992 60.377 10704340 693.420786
1701 1702 Zimbabwe Africa 1997 46.809 11404948 792.449960
1702 1703 Zimbabwe Africa 2002 39.989 11926563 672.038623
1703 1704 Zimbabwe Africa 2007 43.487 12311143 469.709298

1704 rows × 7 columns

Below we plot the distribution of population across the countries in this dataset for the most recent year available.

What does it look like?

g = sns.histplot(
    data = gap[gap["year"] == 2007],
    x = "pop",
    bins = 20 # even with more bins the distribution is similar
)
g.set_xlabel("Population (billions)")
g.set_title("Population by country, 2007")
Text(0.5, 1.0, 'Population by country, 2007')
../../_images/Lecture_12-pre_8_1.png
# How skewed is the data above?

gap["pop"][gap["year"] == 2007].mean() # ~44M
gap["pop"][gap["year"] == 2007].median() # ~10.5M


# ...big difference
10517531.0

Why this is a problem?

Most standard statistics assumes that the variables being predicted or serving as predictors are (roughly) normally distributed. Our population data above clearly isn’t!

How common is this?

The pattern above isn’t unique to population. Many other common variables tend to have similarly shaped distributions.

Can you think of some others? (hint: remember Zipf’s Law back in pset 1?)

Solution: with skewed data, use log transform!

What do we mean by log transforming?

We want to convert our population data to a logarithmic scale rather than a linear scale.

We’ll illustrate this difference below.

# Our original graph used bins scaled at a linear interval. We're printing them below. 

_ = sns.histplot(
    data = gap[gap["year"] == 2007],
    x = "pop",
    bins = 20
)

# Here's our original histogram bins: note how they increase by a fixed amount
print("Histogram bins: {}".format([str(elem) for elem in plt.xticks()[0]]))

# Can print each element minus the previous elemnt to show this
# [plt.xticks()[0][i+1] - plt.xticks()[0][i] for i in range(len(plt.xticks()[0])-1)]
Histogram bins: ['-200000000.0', '0.0', '200000000.0', '400000000.0', '600000000.0', '800000000.0', '1000000000.0', '1200000000.0', '1400000000.0']
../../_images/Lecture_12-pre_13_1.png
# Step 1: let's generate logarithmic scaled bins instead of the above.


# Instead of increasing by a fixed amount (200k), they will increase by a fixed *multiple* (10) from 100k to 1B
# Use `np.logspace` for this
log_bins = np.logspace(base = 10, # specify the base (10 is default)
                       start = 5, # specify the start, which is base**start (in this case 10e5)
                       stop = 9, # specify the end, which is base**end (in this case 10e9)
                       num = 10) # specify the number of bins

[str(elem) for elem in log_bins]

# These don't increase by a fixed amount
# Can print each element minus the previous element as we did above to show this

# [log_bins[i+1] - log_bins[i] for i in range(len(log_bins)-1)]

# Instead, they increase by a fixed *multiple*
# Can print each element in log_bins *divided by* the previous element to show this

# [log_bins[i+1] / log_bins[i] for i in range(len(log_bins)-1)]
['100000.0',
 '278255.94022071257',
 '774263.6826811278',
 '2154434.6900318824',
 '5994842.503189409',
 '16681005.372000592',
 '46415888.33612773',
 '129154966.50148827',
 '359381366.3804626',
 '1000000000.0']
# Now, let's use these logarithmic intervals as the basis for our histogram bins

g = sns.histplot(
    data = gap[gap["year"] == 2007],
    x = "pop",
    bins = log_bins # This is the key change
)

# NOTE: we need to also specify that the x axis should be drawn on a logarithmic scale 
# (try graphing without this to see what happens!)
g.set_xscale('log')
../../_images/Lecture_12-pre_15_0.png

Our data looks normally distributed when we plot it on a log scale. Woo hoo!

But we haven’t changed the underlying data.

Let’s log transform the data itself so its (log) values are normally distributed.

# To do this, use np.log10 (np.log uses the *natural log*)
gap['log_pop'] = np.log10(gap['pop'])

gap
Unnamed: 0 country continent year lifeExp pop gdpPercap log_pop
0 1 Afghanistan Asia 1952 28.801 8425333 779.445314 6.925587
1 2 Afghanistan Asia 1957 30.332 9240934 820.853030 6.965716
2 3 Afghanistan Asia 1962 31.997 10267083 853.100710 7.011447
3 4 Afghanistan Asia 1967 34.020 11537966 836.197138 7.062129
4 5 Afghanistan Asia 1972 36.088 13079460 739.981106 7.116590
... ... ... ... ... ... ... ... ...
1699 1700 Zimbabwe Africa 1987 62.351 9216418 706.157306 6.964562
1700 1701 Zimbabwe Africa 1992 60.377 10704340 693.420786 7.029560
1701 1702 Zimbabwe Africa 1997 46.809 11404948 792.449960 7.057093
1702 1703 Zimbabwe Africa 2002 39.989 11926563 672.038623 7.076515
1703 1704 Zimbabwe Africa 2007 43.487 12311143 469.709298 7.090298

1704 rows × 8 columns

Now what? Let’s take a look at our log transformed population variable.

Is it normally distributed?

g = sns.histplot(data = gap[gap['year'] == 2007], x = 'log_pop')
g.set_xlabel("Log transformed population")
Text(0.5, 0, 'Log transformed population')
../../_images/Lecture_12-pre_19_1.png

Log transformations: Summary

  • Statistics and modeling solutions often assume that the underlying variables are normally distributed

  • You can count on many variables in the world being roughly normally distributed (especially with large samples!)

  • But certain types of data are reliably not normally distributed (ex. income, wealth, population, number of Twitter followers, number of Spotify streams, …)

  • When your data looks like the examples above (rule of thumb: roughly exponentially distributed, or has very large right skew), it’s often the case that the logarithm of the data is normally distributed.

  • You can check whether this is true by plotting it on a log scale as we did above. If so, consider log transforming your data.

Note: using the log transformed values for a variable in statistics or regression changes how you interpret your results (for example, regression coefficients on a log-transformed variable X will reflect the impact of multiplicative changes to X on the output variable Y).


Part 2: z-scoring

Problem: comparing data from different distributions

To get a sense of when we may need to z-score our data, let’s take a look at our pokemon dataset once again:

pk = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/Pokemon.csv")

pk
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

Let’s say we want to know whether each of our pokemon is going to be better off attacking opponents, or simply outlasting them.

The Attack variable indicates how strong they are at attacking, while HP indicates how long they can withstand attacks.

How would we evaluate which of these is stronger for each pokemon?

Strategy 1: look at which variable is larger

# "attacker" variable says whether a given pokemon has higher attack value than HP
pk['attacker'] = pk['Attack'] > pk['HP']

pk
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary attacker
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False True
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False True
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False True
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False True
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True True

800 rows × 14 columns

# We can plot this variable across our generations
sns.countplot(data = pk,
              x = 'Generation',
              hue = 'attacker'
             )
<AxesSubplot:xlabel='Generation', ylabel='count'>
../../_images/Lecture_12-pre_29_1.png

However, there’s one challenge with this strategy.

If many pokemon have higher Attack than HP values, we can’t always tell whether a pokemon has a greater advantage when attacking or withstanding an opponent by just looking at which value is higher.

To see this, take a look at the distribution of HP and Attack variables below.

# For the graph below, it's easiest to switch to tidy data!

pk_tidy = pk.melt(
    id_vars = ["#", "Name", "Type 1", "Type 2", "Generation", "Legendary", "attacker"]
)
pk_tidy


sns.histplot(
    data = pk_tidy[pk_tidy['variable'].isin(["HP", "Attack"])],
    x = "value",
    hue = "variable",
    kde = True,
    bins = 15
)
<AxesSubplot:xlabel='value', ylabel='Count'>
../../_images/Lecture_12-pre_31_1.png

In the graph above, the Attack distribution is shifted right and seems more spread out than the HP distribution, so most pokemon will likely have higher Attack points.

What we really want to know might be something like: relative to the competition, does each pokemon have a more impressive Attack or HP value?

Solution: when comparing across (normal) distributions, try z-scores!

What is a z-score?

A z-score for a given observation \(x\) is the number of standard deviations away from the mean it is.

\(Z = \dfrac{x - \mu}{\sigma}\)

# Let's pre-compute the mean and standard deviation of each distribution in our data (Attack and HP)

mean_attack = pk['Attack'].mean() # 79
mean_hp = pk['HP'].mean() # 69

sd_attack = pk['Attack'].std() # 32.5
sd_hp = pk['HP'].std() # 25.5

# Note that these are fairly different in roughly the way we described above.
# Now, let's evaluate the z score for each pokemon's Attack and HP values

pk['attack_z'] = (pk['Attack'] - mean_attack) / sd_attack
pk['hp_z'] = (pk['HP'] - mean_hp) / sd_hp

pk
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary attacker attack_z hp_z
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False True -0.924328 -0.950032
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False True -0.523803 -0.362595
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False True 0.092390 0.420654
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False True 0.646964 0.420654
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False True -0.831899 -1.185007
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True True 0.646964 -0.754220
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True True 2.495543 -0.754220
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True True 0.955061 0.420654
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True True 2.495543 0.420654
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True True 0.955061 0.420654

800 rows × 16 columns

What do z-score distributions look like?

sns.kdeplot(data = pk, x = 'attack_z', color = 'red', alpha = 0.5)
sns.kdeplot(data = pk, x = 'hp_z', color = 'green', alpha = 0.5)

plt.xlabel("Z score")
Text(0.5, 0, 'Z score')
../../_images/Lecture_12-pre_38_1.png

These are much more comparable!

Now, we can ask which pokemon have a higher z-scored Attack relative to their z-scored HP

pk['z_attacker'] = pk['attack_z'] > pk['hp_z']

pk
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary attacker attack_z hp_z z_attacker
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False True -0.924328 -0.950032 True
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False True -0.523803 -0.362595 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False True 0.092390 0.420654 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False True 0.646964 0.420654 True
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False True -0.831899 -1.185007 True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True True 0.646964 -0.754220 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True True 2.495543 -0.754220 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True True 0.955061 0.420654 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True True 2.495543 0.420654 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True True 0.955061 0.420654 True

800 rows × 17 columns

# We can plot this variable across our generations
sns.countplot(data = pk,
              x = 'Generation',
              hue = 'z_attacker'
             )

# Note it's a little less dramatic than our previous plot!
<AxesSubplot:xlabel='Generation', ylabel='count'>
../../_images/Lecture_12-pre_41_1.png

This plot provides a little more balanced picture than our previous one.

We can also use the z-scores to compare each pokemon’s Attack and HP in a more fine-grained way:

# Let's see which pokemon have the largest difference (in standard deviation units) between their `Attack` and `HP`
pk['attack_diff_normalized'] = pk['attack_z'] - pk['hp_z']

pk.nlargest(10, 'attack_diff_normalized')
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary attacker attack_z hp_z z_attacker attack_diff_normalized
429 386 DeoxysAttack Forme Psychic NaN 600 50 180 20 180 20 150 3 True True 3.111736 -0.754220 True 3.865956
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True True 2.495543 -0.754220 True 3.249763
316 292 Shedinja Bug Ghost 236 1 90 45 30 30 40 3 False True 0.338868 -2.673179 True 3.012047
428 386 DeoxysNormal Forme Psychic NaN 600 50 150 50 150 50 150 3 True True 2.187446 -0.754220 True 2.941666
387 354 BanetteMega Banette Ghost NaN 555 64 165 75 93 83 75 3 False True 2.649591 -0.205945 True 2.855537
232 214 HeracrossMega Heracross Bug Fighting 600 80 185 115 40 105 75 2 False True 3.265784 0.420654 True 2.845131
527 475 GalladeMega Gallade Psychic Fighting 618 68 165 95 65 115 110 4 False True 2.649591 -0.049296 True 2.698887
750 681 AegislashBlade Forme Steel Ghost 520 60 150 50 150 50 60 6 False True 2.187446 -0.362595 True 2.550042
137 127 PinsirMega Pinsir Bug Flying 600 65 155 120 65 90 105 1 False True 2.341495 -0.166783 True 2.508278
19 15 BeedrillMega Beedrill Bug Poison 495 65 150 40 15 80 145 1 False True 2.187446 -0.166783 True 2.354230

We can plot this distribution and use it to evaluate the relative strength of a given pokemon’s Attack

g = sns.kdeplot(data = pk,
            x = 'attack_diff_normalized'
           )


g.axvline(pk[pk['Name'] == 'Shedinja']['attack_diff_normalized'].iloc[0],
            color = 'r',
            ls = '--'
           )

g.text(3.25, 0.3, "Shedinja")

g.set_xlim(-5, 5)
g.set_title("Distribution of differences in Attack and HP z-scores")
g.set_xlabel("Normalized Attack - HP")
Text(0.5, 0, 'Normalized Attack - HP')
../../_images/Lecture_12-pre_45_1.png

Z-scores: Summary

  • Sometimes for analysis purposes, you’ll want to compare variables that are measured on different scales or show differences in the underlying distribution (ex. how to compare SAT and ACT scores?)

  • To make these comparisons, it’s often easier to convert the data to a similar scale for comparison.

  • A z-score converts each data point to a number of standard deviations above or below the mean, which means you can more easily compare two scores that are otherwise very different or use these scores as the basis for regression or other analyses.