22)¶

Last time we covered:

numpy continuation
pandas basics

Today’s agenda:

Processing data with pandas

import pandas as pd
import numpy as np

Creating new dataframes¶

It’s worth noting: there are many different ways to do operations like dataframe creation in pandas.

This is probably the most intuitive but you may come across others!

# First let's initialize some data
# Typically, we want to think of our data as being lists of the stuff in each column
uc_schools = np.array(["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco", 
                       "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"])
uc_founded = [1868, 1960, 1919, 1909, 1864, 1965, 1905, 1954, 1965, 2005]
uc_students = pd.Series([45057, 42875, 45742, 26314, 3132, 35220, 40031, 25548, 19161, 8847])
uc_grads = np.zeros((10))

# Now let's put it in a dataframe!
uc_data = pd.DataFrame({ # start by declaring a new data frame
    "Campus": uc_schools, # each column name is a dictionary key and the list of column data points is the value
    "Date_founded": uc_founded,
    "Number_of_students": uc_students,
#     "grads": uc_grads
})

uc_data

	Campus	Date_founded	Number_of_students
0	Berkeley	1868	45057
1	San Diego	1960	42875
2	Los Angeles	1919	45742
3	Santa Barbara	1909	26314
4	San Francisco	1864	3132
5	Irvine	1965	35220
6	Davis	1905	40031
7	Riverside	1954	25548
8	Santa Cruz	1965	19161
9	Merced	2005	8847

We will likely find ourselves reading in data more often than creating new dataframes, but it’s occassionally useful and good to know how to do it!

Adding data to existing dataframes¶

Adding columns (common!)¶

We may often find ourselves adding columns to a dataframe, e.g., creating columns that represent our existing data in a new way

# One option: when we have new data to add
uc_undergrads = pd.Series([33343, 31543, 23349, 0, 30222, 31162, 22055, 17207, 8151, 31814],
                         index = ["San Diego", "Los Angeles", "Santa Barbara", "San Francisco", 
                       "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced", "Berkeley"])

# uc_undergrads.index = ["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco", 
#                        "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"]

uc_data.index = ["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco", 
                       "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"]

# uc_undergrads
uc_data['Undergraduates'] = uc_undergrads # Use bracket syntax to declare a new column

uc_data

	Campus	Date_founded	Number_of_students	Undergraduates
Berkeley	Berkeley	1868	45057	31814
San Diego	San Diego	1960	42875	33343
Los Angeles	Los Angeles	1919	45742	31543
Santa Barbara	Santa Barbara	1909	26314	23349
San Francisco	San Francisco	1864	3132	0
Irvine	Irvine	1965	35220	30222
Davis	Davis	1905	40031	31162
Riverside	Riverside	1954	25548	22055
Santa Cruz	Santa Cruz	1965	19161	17207
Merced	Merced	2005	8847	8151

# A second option: when we want to process existing data and form a new column

uc_data = uc_data.assign(Undergraduate_pct = uc_data['Undergraduates'] / uc_data['Number_of_students'])

# uc_data['Undergraduate_pct'] = uc_data['Undergraduates'] / uc_data['Number_of_students']

uc_data # Note what happens if we don't do the re-assignment above

uc_data = uc_data.assign(Total = uc_data['Undergraduates']+ uc_data['Number_of_students'])

uc_data
# For this sort of processing, we can also use similar syntax to the above (try it yourself!)
# but `assign` gives us some additional flexibility

	Campus	Date_founded	Number_of_students	Undergraduates	Undergraduate_pct	Total
Berkeley	Berkeley	1868	45057	31814	0.706083	76871
San Diego	San Diego	1960	42875	33343	0.777679	76218
Los Angeles	Los Angeles	1919	45742	31543	0.689585	77285
Santa Barbara	Santa Barbara	1909	26314	23349	0.887322	49663
San Francisco	San Francisco	1864	3132	0	0.000000	3132
Irvine	Irvine	1965	35220	30222	0.858092	65442
Davis	Davis	1905	40031	31162	0.778447	71193
Riverside	Riverside	1954	25548	22055	0.863277	47603
Santa Cruz	Santa Cruz	1965	19161	17207	0.898022	36368
Merced	Merced	2005	8847	8151	0.921329	16998

Adding rows¶

# Let's say the UC system decides it's long overdue to build a campus in Lake Tahoe near the Nevada border. 
# We want to add some projected data

# First, we make our new row into a dataframe of its own, with matching columns
uc_tahoe = pd.DataFrame({
    "Campus": ["Lake Tahoe"],
    "Date founded": [2022] # Note we don't need to know all the column info here
})

uc_tahoe

	Campus	Date founded
0	Lake Tahoe	2022

# Next, we use `concat` to add it to the previous dataframe
uc_data = pd.concat([uc_data, uc_tahoe])

uc_data

	Campus	Date_founded	Number_of_students	Undergraduates	Undergraduate_pct	Total	Date founded
Berkeley	Berkeley	1868.0	45057.0	31814.0	0.706083	76871.0	NaN
San Diego	San Diego	1960.0	42875.0	33343.0	0.777679	76218.0	NaN
Los Angeles	Los Angeles	1919.0	45742.0	31543.0	0.689585	77285.0	NaN
Santa Barbara	Santa Barbara	1909.0	26314.0	23349.0	0.887322	49663.0	NaN
San Francisco	San Francisco	1864.0	3132.0	0.0	0.000000	3132.0	NaN
Irvine	Irvine	1965.0	35220.0	30222.0	0.858092	65442.0	NaN
Davis	Davis	1905.0	40031.0	31162.0	0.778447	71193.0	NaN
Riverside	Riverside	1954.0	25548.0	22055.0	0.863277	47603.0	NaN
Santa Cruz	Santa Cruz	1965.0	19161.0	17207.0	0.898022	36368.0	NaN
Merced	Merced	2005.0	8847.0	8151.0	0.921329	16998.0	NaN
0	Lake Tahoe	NaN	NaN	NaN	NaN	NaN	2022.0

Processing data: filtering, grouping, summarizing¶

First, note that some operations can be done on our dataframe without having to use filtering or grouping

uc_data = uc_data.sort_values('Date_founded', ascending = True)
uc_data

# Note this doesn't change index at far left!

	Campus	Date_founded	Number_of_students	Undergraduates	Undergraduate_pct	Total	Date founded
San Francisco	San Francisco	1864.0	3132.0	0.0	0.000000	3132.0	NaN
Berkeley	Berkeley	1868.0	45057.0	31814.0	0.706083	76871.0	NaN
Davis	Davis	1905.0	40031.0	31162.0	0.778447	71193.0	NaN
Santa Barbara	Santa Barbara	1909.0	26314.0	23349.0	0.887322	49663.0	NaN
Los Angeles	Los Angeles	1919.0	45742.0	31543.0	0.689585	77285.0	NaN
Riverside	Riverside	1954.0	25548.0	22055.0	0.863277	47603.0	NaN
San Diego	San Diego	1960.0	42875.0	33343.0	0.777679	76218.0	NaN
Irvine	Irvine	1965.0	35220.0	30222.0	0.858092	65442.0	NaN
Santa Cruz	Santa Cruz	1965.0	19161.0	17207.0	0.898022	36368.0	NaN
Merced	Merced	2005.0	8847.0	8151.0	0.921329	16998.0	NaN
0	Lake Tahoe	NaN	NaN	NaN	NaN	NaN	2022.0

print(uc_data.max())

Campus                Santa Cruz
Date_founded                2005
Number_of_students         45742
Undergraduates             33343
Undergraduate_pct       0.921329
Total                      77285
Date founded                2022
dtype: object

print(uc_data['Number_of_students'].min())

3132.0

Filtering: fetch rows that meet certain conditions¶

Filtering operations given by pandas¶

We can do some kinds of filtering with built-in operations like nlargest

top_3_students = uc_data.nlargest(3, 'Number_of_students')
top_3_students

	Campus	Date_founded	Number_of_students	Undergraduates	Undergraduate_pct	Total	Date founded
Los Angeles	Los Angeles	1919.0	45742.0	31543.0	0.689585	77285.0	NaN
Berkeley	Berkeley	1868.0	45057.0	31814.0	0.706083	76871.0	NaN
San Diego	San Diego	1960.0	42875.0	33343.0	0.777679	76218.0	NaN

Logical filtering¶

Most often, we filter by setting logical criteria over certain column values

# Get quantiles of undergraduate sizes
q = uc_data['Undergraduates'].quantile([0.25, 0.75])
q[0.75]

31447.75

# Fetch uc_data rows where uc_data['Undergraduates'] is > the 75th percentile value above
large_undergrads = uc_data[uc_data['Undergraduates'] > q[0.75]] 
large_undergrads

	Campus	Date_founded	Number_of_students	Undergraduates	Undergraduate_pct	Total	Date founded
Berkeley	Berkeley	1868.0	45057.0	31814.0	0.706083	76871.0	NaN
Los Angeles	Los Angeles	1919.0	45742.0	31543.0	0.689585	77285.0	NaN
San Diego	San Diego	1960.0	42875.0	33343.0	0.777679	76218.0	NaN

How does this work?

uc_data['Undergraduates'] > q[0.75]

San Francisco    False
Berkeley          True
Davis            False
Santa Barbara    False
Los Angeles       True
Riverside        False
San Diego         True
Irvine           False
Santa Cruz       False
Merced           False
0                False
Name: Undergraduates, dtype: bool

So in the code above, we’re basically saying “fetch the rows where this condition evaluates to True”.

Note this can get much more complicated…

# What's going on here?
max_undergrad = uc_data[uc_data['Undergraduates'] == uc_data['Undergraduates'].max()]
max_undergrad

	Campus	Date_founded	Number_of_students	Undergraduates	Undergraduate_pct	Total	Date founded
San Diego	San Diego	1960.0	42875.0	33343.0	0.777679	76218.0	NaN

Why is this useful?¶

Let’s ask: what is the average number of undergraduates in the schools with the most students overall

q = uc_data['Number_of_students'].quantile([0.2, 0.80])
# type(q)
# q.index
q[0.8]

# What's going on here?
uc_data[uc_data['Number_of_students'] >= q[0.8]]['Undergraduates'].mean()
uc_data[uc_data['Number_of_students'] >= q[0.2]]['Undergraduates'].mean()
# uc_data[uc_data['Number_of_students'] >= q[0.4]]['Undergraduates'].mean()

27586.875

Here, we applied some statistics to a filtered subset of our data.

Imagine we wanted the same average for each quartile of student numbers, instead of just the top 20%.

This kind of thing is pretty common: this is where grouping comes in.

Grouping and summarizing: analyze and graph your data¶

First let’s read in some more complex data.

What’s going on with this data?

# pokemon = pd.read_csv("../Datasets/Pokemon.csv")
# Use this code in class
pokemon = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/Pokemon.csv")
pokemon

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	600	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	700	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	600	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	680	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	600	80	110	120	130	90	70	6	True

800 rows × 13 columns

Let’s say we want to know which Type 1 group has the most HP on average [any guesses?]

# First, what are we dealing with here?
pokemon['Type 1'].value_counts()

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Ground       32
Ghost        32
Dragon       32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

One solution:

Look at each individual Type 1 value in our data
Then, one by one, filter the data to match each unique Type 1 value
Then, compute the average HP in that filtered data
Then save it somewhere else to keep track of
Then see which one is largest

This seems pretty tough…

Let’s start with something simpler: how many of each kind of Type 1 pokemon are there?

pokemon.groupby(['Type 1']).size().reset_index()

	Type 1	0
0	Bug	69
1	Dark	31
2	Dragon	32
3	Electric	44
4	Fairy	17
5	Fighting	27
6	Fire	52
7	Flying	4
8	Ghost	32
9	Grass	70
10	Ground	32
11	Ice	24
12	Normal	98
13	Poison	28
14	Psychic	57
15	Rock	44
16	Steel	27
17	Water	112

Above, we just asked pandas to tell us the value of size() applied to each unique group of Type 1 pokemon.

Can we do the same thing but for more complex operations than size()? You bet!

pokemon.groupby(['Type 1']).agg( # .agg is our friend here!
    mean_hp = ('HP', np.mean) # this defines a new statistic over each grouping. Apply `np.mean` to the `HP` column
).reset_index()

	Type 1	mean_hp
0	Bug	56.884058
1	Dark	66.806452
2	Dragon	83.312500
3	Electric	59.795455
4	Fairy	74.117647
5	Fighting	69.851852
6	Fire	69.903846
7	Flying	70.750000
8	Ghost	64.437500
9	Grass	67.271429
10	Ground	73.781250
11	Ice	72.000000
12	Normal	77.275510
13	Poison	67.250000
14	Psychic	70.631579
15	Rock	65.363636
16	Steel	65.222222
17	Water	72.062500

We don’t need to stop there. We can group by multiple variables and add multiple metrics!

type1_hp_summary = pokemon.groupby(
    ['Type 1', 'Legendary'] # Note we're now grouping by each combination of Type 1 and Legendary
).agg(
    mean_hp = ('HP', np.mean), # And we're compiling multiple statistics here
    min_hp = ('HP', np.min),
    max_hp = ('HP', np.max)
).reset_index()

type1_hp_summary

	Type 1	Legendary	mean_hp	min_hp	max_hp
0	Bug	False	56.884058	1	86
1	Dark	False	64.655172	35	110
2	Dark	True	98.000000	70	126
3	Dragon	False	72.650000	41	108
4	Dragon	True	101.083333	80	125
5	Electric	False	57.325000	20	90
6	Electric	True	84.500000	79	90
7	Fairy	False	70.875000	35	101
8	Fairy	True	126.000000	126	126
9	Fighting	False	69.851852	30	144
10	Fire	False	67.085106	38	110
11	Fire	True	96.400000	80	115
12	Flying	False	62.500000	40	85
13	Flying	True	79.000000	79	79
14	Ghost	False	58.733333	20	150
15	Ghost	True	150.000000	150	150
16	Grass	False	65.940299	30	123
17	Grass	True	97.000000	91	100
18	Ground	False	70.821429	10	115
19	Ground	True	94.500000	89	100
20	Ice	False	70.818182	36	110
21	Ice	True	85.000000	80	90
22	Normal	False	76.489583	30	255
23	Normal	True	115.000000	110	120
24	Poison	False	67.250000	35	105
25	Psychic	False	67.720930	20	190
26	Psychic	True	79.571429	50	106
27	Rock	False	65.125000	30	123
28	Rock	True	67.750000	50	91
29	Steel	False	60.434783	40	80
30	Steel	True	92.750000	80	100
31	Water	False	71.120370	20	170
32	Water	True	97.500000	90	100

Now we can apply the filtering we discussed above for example.

type1_hp_summary.nlargest(5, 'mean_hp')

	Type 1	Legendary	mean_hp	min_hp	max_hp
15	Ghost	True	150.000000	150	150
8	Fairy	True	126.000000	126	126
23	Normal	True	115.000000	110	120
4	Dragon	True	101.083333	80	125
2	Dark	True	98.000000	70	126

Let’s practice!¶

In each Generation, how many different Type 1 and Type 2 types are there? (did they change across generations?)

[HINT: use the nunique summary function]

# Write here

# SOLUTION
# pokemon.groupby('Generation').agg(
#     type1_types = ('Type 1', 'nunique'),
#     type2_types = ('Type 2', 'nunique')
# ).reset_index()

Make a new column called Composite_force that’s the average of each pokemon’s Attack, Defense, and Speed values.

Next, for each Type 1 type, what is the maximum of this new column?

# Write here

# SOLUTION
# pokemon = pokemon.assign(
#     Composite_force = (pokemon['Attack'] + pokemon['Defense'] + pokemon['Speed'])/3
# )

# max_force = pokemon.groupby(['Generation']).agg(
#     max_force = ('Composite_force', np.max)
# ).reset_index()

Lecture 5 (4/6/22) Lecture 7 (guest) - Introduction to Data Visualization in Python

UCSD CSS 2 - Spring 2022