22)¶

Last time we covered:

numpy continuation
pandas basics

Today’s agenda:

Processing data with pandas

import pandas as pd
import numpy as np

Creating new dataframes¶

It’s worth noting: there are many different ways to do operations like dataframe creation in pandas.

This is probably the most intuitive but you may come across others!

# First let's initialize some data
# Typically, we want to think of our data as being lists of the stuff in each column
uc_schools = np.array(["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco", 
                       "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"])
uc_founded = np.array([1868, 1960, 1919, 1909, 1864, 1965, 1905, 1954, 1965, 2005])
uc_students = np.array([45057, 42875, 45742, 26314, 3132, 35220, 40031, 25548, 19161, 8847])


# Now let's put it in a dataframe!
uc_data = pd.DataFrame({ # start by declaring a new data frame
    "School": uc_schools, # each column name is a dictionary key and the list of column data points is the value
    "Date founded": uc_founded,
    "Number of students": uc_students
})

uc_data

	School	Date founded	Number of students
0	Berkeley	1868	45057
1	San Diego	1960	42875
2	Los Angeles	1919	45742
3	Santa Barbara	1909	26314
4	San Francisco	1864	3132
5	Irvine	1965	35220
6	Davis	1905	40031
7	Riverside	1954	25548
8	Santa Cruz	1965	19161
9	Merced	2005	8847

We will likely find ourselves reading in data more often than creating new dataframes, but it’s occassionally useful and good to know how to do it!

Adding data to existing dataframes¶

Adding columns (common!)¶

We may often find ourselves adding columns to a dataframe, e.g., creating columns that represent our existing data in a new way

# One option: when we have new data to add
uc_undergrads = np.array([31814, 33343, 31543, 23349, 0, 30222, 31162, 22055, 17207, 8151])

uc_data['Undergraduates'] = uc_undergrads # Use bracket syntax to declare a new column

uc_data

	School	Date founded	Number of students	Undergraduates
0	Berkeley	1868	45057	31814
1	San Diego	1960	42875	33343
2	Los Angeles	1919	45742	31543
3	Santa Barbara	1909	26314	23349
4	San Francisco	1864	3132	0
5	Irvine	1965	35220	30222
6	Davis	1905	40031	31162
7	Riverside	1954	25548	22055
8	Santa Cruz	1965	19161	17207
9	Merced	2005	8847	8151

# A second option: when we want to process existing data and form a new column

uc_data = uc_data.assign(Undergraduate_pct = uc_data['Undergraduates'] / uc_data['Number of students'])

uc_data # Note what happens if we don't do the re-assignment above

# For this sort of processing, we can also use similar syntax to the above (try it yourself!)
# but `assign` gives us some additional flexibility

	School	Date founded	Number of students	Undergraduates	Undergraduate_pct
0	Berkeley	1868	45057	31814	0.706083
1	San Diego	1960	42875	33343	0.777679
2	Los Angeles	1919	45742	31543	0.689585
3	Santa Barbara	1909	26314	23349	0.887322
4	San Francisco	1864	3132	0	0.000000
5	Irvine	1965	35220	30222	0.858092
6	Davis	1905	40031	31162	0.778447
7	Riverside	1954	25548	22055	0.863277
8	Santa Cruz	1965	19161	17207	0.898022
9	Merced	2005	8847	8151	0.921329

Adding rows¶

# Let's say the UC system decides it's long overdue to build a campus in Lake Tahoe near the Nevada border. 
# We want to add some projected data

# First, we make our new row into a dataframe of its own, with matching columns
uc_tahoe = pd.DataFrame({
    "School": ["Lake Tahoe"],
    "Date founded": [2022] # Note we don't need to know all the column info here
})

uc_tahoe

	School	Date founded
0	Lake Tahoe	2022

# Next, we use `concat` to add it to the previous dataframe
uc_data = pd.concat([uc_data, uc_tahoe])

uc_data

	School	Date founded	Number of students	Undergraduates	Undergraduate_pct
0	Berkeley	1868	45057.0	31814.0	0.706083
1	San Diego	1960	42875.0	33343.0	0.777679
2	Los Angeles	1919	45742.0	31543.0	0.689585
3	Santa Barbara	1909	26314.0	23349.0	0.887322
4	San Francisco	1864	3132.0	0.0	0.000000
5	Irvine	1965	35220.0	30222.0	0.858092
6	Davis	1905	40031.0	31162.0	0.778447
7	Riverside	1954	25548.0	22055.0	0.863277
8	Santa Cruz	1965	19161.0	17207.0	0.898022
9	Merced	2005	8847.0	8151.0	0.921329
0	Lake Tahoe	2022	NaN	NaN	NaN

Processing data: filtering, grouping, summarizing¶

First, note that some operations can be done on our dataframe without having to use filtering or grouping

uc_data = uc_data.sort_values('Date founded', ascending = True)
uc_data

# Note this doesn't change index at far left!

	School	Date founded	Number of students	Undergraduates	Undergraduate_pct
4	San Francisco	1864	3132.0	0.0	0.000000
0	Berkeley	1868	45057.0	31814.0	0.706083
6	Davis	1905	40031.0	31162.0	0.778447
3	Santa Barbara	1909	26314.0	23349.0	0.887322
2	Los Angeles	1919	45742.0	31543.0	0.689585
7	Riverside	1954	25548.0	22055.0	0.863277
1	San Diego	1960	42875.0	33343.0	0.777679
5	Irvine	1965	35220.0	30222.0	0.858092
8	Santa Cruz	1965	19161.0	17207.0	0.898022
9	Merced	2005	8847.0	8151.0	0.921329
0	Lake Tahoe	2022	NaN	NaN	NaN

print(uc_data.max())
print(uc_data['Number of students'].min())

School                Santa Cruz
Date founded                2022
Number of students         45742
Undergraduates             33343
Undergraduate_pct       0.921329
dtype: object
3132.0

Filtering: fetch rows that meet certain conditions¶

Filtering operations given by pandas¶

We can do some kinds of filtering with built-in operations like nlargest

top_3_students = uc_data.nlargest(3, 'Number of students')
top_3_students

	School	Date founded	Number of students	Undergraduates	Undergraduate_pct
2	Los Angeles	1919	45742.0	31543.0	0.689585
0	Berkeley	1868	45057.0	31814.0	0.706083
1	San Diego	1960	42875.0	33343.0	0.777679

Logical filtering¶

Most often, we filter by setting logical criteria over certain column values

# Get quantiles of undergraduate sizes
q = uc_data['Undergraduates'].quantile([0.25, 0.75])
q[0.75]

# Fetch uc_data rows where uc_data['Undergraduates'] is > the 75th percentile value above
large_undergrads = uc_data[uc_data['Undergraduates'] > q[0.75]] 
large_undergrads

	School	Date founded	Number of students	Undergraduates	Undergraduate_pct
0	Berkeley	1868	45057.0	31814.0	0.706083
2	Los Angeles	1919	45742.0	31543.0	0.689585
1	San Diego	1960	42875.0	33343.0	0.777679

How does this work?

uc_data['Undergraduates'] > q[0.75]

  False
   True
  False
  False
   True
  False
   True
  False
  False
  False
  False
Name: Undergraduates, dtype: bool

So in the code above, we’re basically saying “fetch the rows where this condition evaluates to True”.

Note this can get much more complicated…

# What's going on here?
max_undergrad = uc_data[uc_data['Undergraduates'] == uc_data['Undergraduates'].max()]
max_undergrad

	School	Date founded	Number of students	Undergraduates	Undergraduate_pct
1	San Diego	1960	42875.0	33343.0	0.777679

Why is this useful?¶

Let’s ask: what is the average number of undergraduates in the schools with the most students overall

q = uc_data['Number of students'].quantile([0.80])

# What's going on here?
uc_data[uc_data['Number of students'] >= q[0.8]]['Undergraduates'].mean()

31678.5

Here, we applied some statistics to a filtered subset of our data.

Imagine we wanted the same average for each quartile of student numbers, instead of just the top 20%.

This kind of thing is pretty common: this is where grouping comes in.

Grouping and summarizing: analyze and graph your data¶

First let’s read in some more complex data.

What’s going on with this data?

# pokemon = pd.read_csv("../Datasets/Pokemon.csv")
# Use this code in class
pokemon = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/Pokemon.csv")
pokemon

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

800 rows × 13 columns

Let’s say we want to know which Type 1 group has the most HP on average [any guesses?]

One solution:

Look at each individual Type 1 value in our data
Then, one by one, filter the data to match each unique Type 1 value
Then, compute the average HP in that filtered data
Then save it somewhere else to keep track of
Then see which one is largest

This seems pretty tough…

Let’s start with something simpler: how many of each kind of Type 1 pokemon are there?

pokemon.groupby(['Type 1']).size().reset_index()

	Type 1	0
0	Bug	69
1	Dark	31
2	Dragon	32
3	Electric	44
4	Fairy	17
5	Fighting	27
6	Fire	52
7	Flying	4
8	Ghost	32
9	Grass	70
10	Ground	32
11	Ice	24
12	Normal	98
13	Poison	28
14	Psychic	57
15	Rock	44
16	Steel	27
17	Water	112

Above, we just asked pandas to tell us the value of size() applied to each unique group of Type 1 pokemon.

Can we do the same thing but for more complex operations than size()? You bet!

pokemon.groupby(['Type 1']).agg( # .agg is our friend here!
    mean_hp = ('HP', np.mean) # this defines a new statistic over each grouping. Apply `np.mean` to the `HP` column
).reset_index()

	Type 1	mean_hp
0	Bug	56.884058
1	Dark	66.806452
2	Dragon	83.312500
3	Electric	59.795455
4	Fairy	74.117647
5	Fighting	69.851852
6	Fire	69.903846
7	Flying	70.750000
8	Ghost	64.437500
9	Grass	67.271429
10	Ground	73.781250
11	Ice	72.000000
12	Normal	77.275510
13	Poison	67.250000
14	Psychic	70.631579
15	Rock	65.363636
16	Steel	65.222222
17	Water	72.062500

We don’t need to stop there. We can group by multiple variables and add multiple metrics!

type1_hp_summary = pokemon.groupby(
    ['Type 1', 'Legendary'] # Note we're now grouping by each combination of Type 1 and Legendary
).agg(
    mean_hp = ('HP', np.mean), # And we're compiling multiple statistics here
    min_hp = ('HP', np.min),
    max_hp = ('HP', np.max)
).reset_index()

type1_hp_summary

	Type 1	Legendary	mean_hp	min_hp	max_hp
0	Bug	False	56.884058	1	86
1	Dark	False	64.655172	35	110
2	Dark	True	98.000000	70	126
3	Dragon	False	72.650000	41	108
4	Dragon	True	101.083333	80	125
5	Electric	False	57.325000	20	90
6	Electric	True	84.500000	79	90
7	Fairy	False	70.875000	35	101
8	Fairy	True	126.000000	126	126
9	Fighting	False	69.851852	30	144
10	Fire	False	67.085106	38	110
11	Fire	True	96.400000	80	115
12	Flying	False	62.500000	40	85
13	Flying	True	79.000000	79	79
14	Ghost	False	58.733333	20	150
15	Ghost	True	150.000000	150	150
16	Grass	False	65.940299	30	123
17	Grass	True	97.000000	91	100
18	Ground	False	70.821429	10	115
19	Ground	True	94.500000	89	100
20	Ice	False	70.818182	36	110
21	Ice	True	85.000000	80	90
22	Normal	False	76.489583	30	255
23	Normal	True	115.000000	110	120
24	Poison	False	67.250000	35	105
25	Psychic	False	67.720930	20	190
26	Psychic	True	79.571429	50	106
27	Rock	False	65.125000	30	123
28	Rock	True	67.750000	50	91
29	Steel	False	60.434783	40	80
30	Steel	True	92.750000	80	100
31	Water	False	71.120370	20	170
32	Water	True	97.500000	90	100

Now we can apply the filtering we discussed above for example.

type1_hp_summary.nlargest(5, 'mean_hp')

	Type 1	Legendary	mean_hp	min_hp	max_hp
15	Ghost	True	150.000000	150	150
8	Fairy	True	126.000000	126	126
23	Normal	True	115.000000	110	120
4	Dragon	True	101.083333	80	125
2	Dark	True	98.000000	70	126

Let’s practice!¶

In each Generation, how many different Type 1 and Type 2 types are there?

[HINT: use the nunique summary function]

# Write here

# SOLUTION
# pokemon.groupby('Generation').agg(
#     type1_types = ('Type 1', 'nunique'),
#     type2_types = ('Type 2', 'nunique')
# ).reset_index()

Make a new column called Composite_force that’s the average of each pokemon’s Attack, Defense, and Speed values.

Next, for each Type 1 type, what is the maximum of this new column?

# Write here

# SOLUTION
# pokemon = pokemon.assign(
#     Composite_force = (pokemon['Attack'] + pokemon['Defense'] + pokemon['Speed'])/3
# )

# max_force = pokemon.groupby(['Generation']).agg(
#     max_force = ('Composite_force', np.max)
# ).reset_index()

Lecture 5 (4/6/22) Week 3

UCSD CSS 2 - Spring 2022