Lecture 6 (4/8/22)

Last time we covered:

  • numpy continuation

  • pandas basics

Today’s agenda:

  • Processing data with pandas

import pandas as pd
import numpy as np

Creating new dataframes

It’s worth noting: there are many different ways to do operations like dataframe creation in pandas.

This is probably the most intuitive but you may come across others!

# First let's initialize some data
# Typically, we want to think of our data as being lists of the stuff in each column
uc_schools = np.array(["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco", 
                       "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"])
uc_founded = np.array([1868, 1960, 1919, 1909, 1864, 1965, 1905, 1954, 1965, 2005])
uc_students = np.array([45057, 42875, 45742, 26314, 3132, 35220, 40031, 25548, 19161, 8847])


# Now let's put it in a dataframe!
uc_data = pd.DataFrame({ # start by declaring a new data frame
    "School": uc_schools, # each column name is a dictionary key and the list of column data points is the value
    "Date founded": uc_founded,
    "Number of students": uc_students
})

uc_data
School Date founded Number of students
0 Berkeley 1868 45057
1 San Diego 1960 42875
2 Los Angeles 1919 45742
3 Santa Barbara 1909 26314
4 San Francisco 1864 3132
5 Irvine 1965 35220
6 Davis 1905 40031
7 Riverside 1954 25548
8 Santa Cruz 1965 19161
9 Merced 2005 8847

We will likely find ourselves reading in data more often than creating new dataframes, but it’s occassionally useful and good to know how to do it!

Adding data to existing dataframes

Adding columns (common!)

We may often find ourselves adding columns to a dataframe, e.g., creating columns that represent our existing data in a new way

# One option: when we have new data to add
uc_undergrads = np.array([31814, 33343, 31543, 23349, 0, 30222, 31162, 22055, 17207, 8151])

uc_data['Undergraduates'] = uc_undergrads # Use bracket syntax to declare a new column

uc_data
School Date founded Number of students Undergraduates
0 Berkeley 1868 45057 31814
1 San Diego 1960 42875 33343
2 Los Angeles 1919 45742 31543
3 Santa Barbara 1909 26314 23349
4 San Francisco 1864 3132 0
5 Irvine 1965 35220 30222
6 Davis 1905 40031 31162
7 Riverside 1954 25548 22055
8 Santa Cruz 1965 19161 17207
9 Merced 2005 8847 8151
# A second option: when we want to process existing data and form a new column

uc_data = uc_data.assign(Undergraduate_pct = uc_data['Undergraduates'] / uc_data['Number of students'])

uc_data # Note what happens if we don't do the re-assignment above

# For this sort of processing, we can also use similar syntax to the above (try it yourself!)
# but `assign` gives us some additional flexibility
School Date founded Number of students Undergraduates Undergraduate_pct
0 Berkeley 1868 45057 31814 0.706083
1 San Diego 1960 42875 33343 0.777679
2 Los Angeles 1919 45742 31543 0.689585
3 Santa Barbara 1909 26314 23349 0.887322
4 San Francisco 1864 3132 0 0.000000
5 Irvine 1965 35220 30222 0.858092
6 Davis 1905 40031 31162 0.778447
7 Riverside 1954 25548 22055 0.863277
8 Santa Cruz 1965 19161 17207 0.898022
9 Merced 2005 8847 8151 0.921329

Adding rows

# Let's say the UC system decides it's long overdue to build a campus in Lake Tahoe near the Nevada border. 
# We want to add some projected data

# First, we make our new row into a dataframe of its own, with matching columns
uc_tahoe = pd.DataFrame({
    "School": ["Lake Tahoe"],
    "Date founded": [2022] # Note we don't need to know all the column info here
})

uc_tahoe
School Date founded
0 Lake Tahoe 2022
# Next, we use `concat` to add it to the previous dataframe
uc_data = pd.concat([uc_data, uc_tahoe])

uc_data
School Date founded Number of students Undergraduates Undergraduate_pct
0 Berkeley 1868 45057.0 31814.0 0.706083
1 San Diego 1960 42875.0 33343.0 0.777679
2 Los Angeles 1919 45742.0 31543.0 0.689585
3 Santa Barbara 1909 26314.0 23349.0 0.887322
4 San Francisco 1864 3132.0 0.0 0.000000
5 Irvine 1965 35220.0 30222.0 0.858092
6 Davis 1905 40031.0 31162.0 0.778447
7 Riverside 1954 25548.0 22055.0 0.863277
8 Santa Cruz 1965 19161.0 17207.0 0.898022
9 Merced 2005 8847.0 8151.0 0.921329
0 Lake Tahoe 2022 NaN NaN NaN

Processing data: filtering, grouping, summarizing

First, note that some operations can be done on our dataframe without having to use filtering or grouping

uc_data = uc_data.sort_values('Date founded', ascending = True)
uc_data

# Note this doesn't change index at far left!
School Date founded Number of students Undergraduates Undergraduate_pct
4 San Francisco 1864 3132.0 0.0 0.000000
0 Berkeley 1868 45057.0 31814.0 0.706083
6 Davis 1905 40031.0 31162.0 0.778447
3 Santa Barbara 1909 26314.0 23349.0 0.887322
2 Los Angeles 1919 45742.0 31543.0 0.689585
7 Riverside 1954 25548.0 22055.0 0.863277
1 San Diego 1960 42875.0 33343.0 0.777679
5 Irvine 1965 35220.0 30222.0 0.858092
8 Santa Cruz 1965 19161.0 17207.0 0.898022
9 Merced 2005 8847.0 8151.0 0.921329
0 Lake Tahoe 2022 NaN NaN NaN
print(uc_data.max())
print(uc_data['Number of students'].min())
School                Santa Cruz
Date founded                2022
Number of students         45742
Undergraduates             33343
Undergraduate_pct       0.921329
dtype: object
3132.0

Filtering: fetch rows that meet certain conditions

Filtering operations given by pandas

We can do some kinds of filtering with built-in operations like nlargest

top_3_students = uc_data.nlargest(3, 'Number of students')
top_3_students
School Date founded Number of students Undergraduates Undergraduate_pct
2 Los Angeles 1919 45742.0 31543.0 0.689585
0 Berkeley 1868 45057.0 31814.0 0.706083
1 San Diego 1960 42875.0 33343.0 0.777679

Logical filtering

Most often, we filter by setting logical criteria over certain column values

# Get quantiles of undergraduate sizes
q = uc_data['Undergraduates'].quantile([0.25, 0.75])
q[0.75]

# Fetch uc_data rows where uc_data['Undergraduates'] is > the 75th percentile value above
large_undergrads = uc_data[uc_data['Undergraduates'] > q[0.75]] 
large_undergrads
School Date founded Number of students Undergraduates Undergraduate_pct
0 Berkeley 1868 45057.0 31814.0 0.706083
2 Los Angeles 1919 45742.0 31543.0 0.689585
1 San Diego 1960 42875.0 33343.0 0.777679

How does this work?

uc_data['Undergraduates'] > q[0.75]
4    False
0     True
6    False
3    False
2     True
7    False
1     True
5    False
8    False
9    False
0    False
Name: Undergraduates, dtype: bool

So in the code above, we’re basically saying “fetch the rows where this condition evaluates to True”.

Note this can get much more complicated…

# What's going on here?
max_undergrad = uc_data[uc_data['Undergraduates'] == uc_data['Undergraduates'].max()]
max_undergrad
School Date founded Number of students Undergraduates Undergraduate_pct
1 San Diego 1960 42875.0 33343.0 0.777679

Why is this useful?

Let’s ask: what is the average number of undergraduates in the schools with the most students overall

q = uc_data['Number of students'].quantile([0.80])

# What's going on here?
uc_data[uc_data['Number of students'] >= q[0.8]]['Undergraduates'].mean()
31678.5

Here, we applied some statistics to a filtered subset of our data.

Imagine we wanted the same average for each quartile of student numbers, instead of just the top 20%.

This kind of thing is pretty common: this is where grouping comes in.

Grouping and summarizing: analyze and graph your data

First let’s read in some more complex data.

What’s going on with this data?

# pokemon = pd.read_csv("../Datasets/Pokemon.csv")
# Use this code in class
pokemon = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/Pokemon.csv")
pokemon
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 700 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 680 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

800 rows × 13 columns

Let’s say we want to know which Type 1 group has the most HP on average [any guesses?]

One solution:

  1. Look at each individual Type 1 value in our data

  2. Then, one by one, filter the data to match each unique Type 1 value

  3. Then, compute the average HP in that filtered data

  4. Then save it somewhere else to keep track of

  5. Then see which one is largest

This seems pretty tough…

Let’s start with something simpler: how many of each kind of Type 1 pokemon are there?

pokemon.groupby(['Type 1']).size().reset_index()
Type 1 0
0 Bug 69
1 Dark 31
2 Dragon 32
3 Electric 44
4 Fairy 17
5 Fighting 27
6 Fire 52
7 Flying 4
8 Ghost 32
9 Grass 70
10 Ground 32
11 Ice 24
12 Normal 98
13 Poison 28
14 Psychic 57
15 Rock 44
16 Steel 27
17 Water 112

Above, we just asked pandas to tell us the value of size() applied to each unique group of Type 1 pokemon.

Can we do the same thing but for more complex operations than size()? You bet!

pokemon.groupby(['Type 1']).agg( # .agg is our friend here!
    mean_hp = ('HP', np.mean) # this defines a new statistic over each grouping. Apply `np.mean` to the `HP` column
).reset_index()
Type 1 mean_hp
0 Bug 56.884058
1 Dark 66.806452
2 Dragon 83.312500
3 Electric 59.795455
4 Fairy 74.117647
5 Fighting 69.851852
6 Fire 69.903846
7 Flying 70.750000
8 Ghost 64.437500
9 Grass 67.271429
10 Ground 73.781250
11 Ice 72.000000
12 Normal 77.275510
13 Poison 67.250000
14 Psychic 70.631579
15 Rock 65.363636
16 Steel 65.222222
17 Water 72.062500

We don’t need to stop there. We can group by multiple variables and add multiple metrics!

type1_hp_summary = pokemon.groupby(
    ['Type 1', 'Legendary'] # Note we're now grouping by each combination of Type 1 and Legendary
).agg(
    mean_hp = ('HP', np.mean), # And we're compiling multiple statistics here
    min_hp = ('HP', np.min),
    max_hp = ('HP', np.max)
).reset_index()

type1_hp_summary
Type 1 Legendary mean_hp min_hp max_hp
0 Bug False 56.884058 1 86
1 Dark False 64.655172 35 110
2 Dark True 98.000000 70 126
3 Dragon False 72.650000 41 108
4 Dragon True 101.083333 80 125
5 Electric False 57.325000 20 90
6 Electric True 84.500000 79 90
7 Fairy False 70.875000 35 101
8 Fairy True 126.000000 126 126
9 Fighting False 69.851852 30 144
10 Fire False 67.085106 38 110
11 Fire True 96.400000 80 115
12 Flying False 62.500000 40 85
13 Flying True 79.000000 79 79
14 Ghost False 58.733333 20 150
15 Ghost True 150.000000 150 150
16 Grass False 65.940299 30 123
17 Grass True 97.000000 91 100
18 Ground False 70.821429 10 115
19 Ground True 94.500000 89 100
20 Ice False 70.818182 36 110
21 Ice True 85.000000 80 90
22 Normal False 76.489583 30 255
23 Normal True 115.000000 110 120
24 Poison False 67.250000 35 105
25 Psychic False 67.720930 20 190
26 Psychic True 79.571429 50 106
27 Rock False 65.125000 30 123
28 Rock True 67.750000 50 91
29 Steel False 60.434783 40 80
30 Steel True 92.750000 80 100
31 Water False 71.120370 20 170
32 Water True 97.500000 90 100

Now we can apply the filtering we discussed above for example.

type1_hp_summary.nlargest(5, 'mean_hp')
Type 1 Legendary mean_hp min_hp max_hp
15 Ghost True 150.000000 150 150
8 Fairy True 126.000000 126 126
23 Normal True 115.000000 110 120
4 Dragon True 101.083333 80 125
2 Dark True 98.000000 70 126

Let’s practice!

In each Generation, how many different Type 1 and Type 2 types are there?

[HINT: use the nunique summary function]

# Write here














# SOLUTION
# pokemon.groupby('Generation').agg(
#     type1_types = ('Type 1', 'nunique'),
#     type2_types = ('Type 2', 'nunique')
# ).reset_index()

Make a new column called Composite_force that’s the average of each pokemon’s Attack, Defense, and Speed values.

Next, for each Type 1 type, what is the maximum of this new column?

# Write here
















# SOLUTION
# pokemon = pokemon.assign(
#     Composite_force = (pokemon['Attack'] + pokemon['Defense'] + pokemon['Speed'])/3
# )

# max_force = pokemon.groupby(['Generation']).agg(
#     max_force = ('Composite_force', np.max)
# ).reset_index()