Advanced Pandas
Dplyr style coding in Pandas
By Pallav Routh in Python
September 12, 2021
In this blog post, I demonstrate how to replicate dplyr
style data manipulation in pandas
. A characteristic feature of dplyr
is its ability to chain together multiple operations using the infamous %>%
operator. I will show dplyr
and pandas
code side by side which will further highlight similarities and differences between the two packages. Hopefully, this helps users migrate from dplyr
to pandas
.
Loading pandas and dplyr
import pandas as pd
import numpy as np
suppressPackageStartupMessages(library(dplyr))
Data
We are going to use the popular cigarettes
dataset. Let’s read in the data set.
data = pd.read_csv('https://raw.githubusercontent.com/pallavrouth/AI-Bootcamp/main/Data/cigarettes.csv', index_col = 'Unnamed: 0')
Here is quick overview -
glimpse(py$data)
## Rows: 96
## Columns: 9
## $ state <chr> "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "IA",…
## $ year <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985,…
## $ cpi <dbl> 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.0…
## $ population <dbl> 3973000, 2327000, 3184000, 26444000, 3209000, 3201000, 6180…
## $ packs <dbl> 116.4863, 128.5346, 104.5226, 100.3630, 112.9635, 109.2784,…
## $ income <dbl> 46014968, 26210736, 43956936, 447102816, 49466672, 60063368…
## $ tax <dbl> 32.50000, 37.00000, 31.00000, 26.00000, 31.00000, 42.00000,…
## $ price <dbl> 102.18167, 101.47500, 108.57875, 107.83734, 94.26666, 128.0…
## $ taxs <dbl> 33.34834, 37.00000, 36.17042, 32.10400, 31.00000, 51.48333,…
Chaining in pandas
To demonstrate chaining in pandas, lets perform the following operation on the dataset : select 3 columns and filter rows corresponding to certain states.
In R:
py$data %>%
select(state,year,population) %>%
filter(state %in% c('AL','AZ','VT')) %>%
head(.,5)
## state year population
## 1 AL 1985 3973000
## 3 AZ 1985 3184000
## 44 VT 1985 530000
## 49 AL 1995 4262731
## 51 AZ 1995 4306908
In python:
(
data
.loc[:,['state','year','population']]
.loc[lambda d : d.state.isin(['AL','AZ','VT'])]
.head()
)
## state year population
## 1 AL 1985 3973000.0
## 3 AZ 1985 3184000.0
## 44 VT 1985 530000.0
## 49 AL 1995 4262731.0
## 51 AZ 1995 4306908.0
In python, you need to do two things to employ chaining : (1) use a parenthesis to enclose the set of operations and (2) use the ‘.’ symbol at the end of each operation as a replacement for %>%
.
Now, lets look at some common set of operations in dplyr
.
Select, filter and sort
The first advanced operation is selecting columns following by filtering some rows followed by sorting. The sequence may not be same in all scenarios.
In R:
py$data %>%
select(state,year,population) %>%
filter(state %in% c('AL','AZ','VT') & population > 10e3) %>%
arrange(population) %>%
head(.,5)
## state year population
## 44 VT 1985 530000
## 92 VT 1995 582827
## 3 AZ 1985 3184000
## 1 AL 1985 3973000
## 49 AL 1995 4262731
In python:
(data
.loc[:,['state','year','population']]
.loc[lambda d : (d.state.isin(['AL','AZ','VT'])) & (d.population > 10e3)]
.sort_values(['population'])
.head())
## state year population
## 44 VT 1985 530000.0
## 92 VT 1995 582827.0
## 3 AZ 1985 3184000.0
## 1 AL 1985 3973000.0
## 49 AL 1995 4262731.0
For multiple selections, I prefer using query
instead of loc
.
(data
.loc[:,['state','year','population']]
.query('(population > 10e3) and (state in ["AL","AZ","VT"])')
.sort_values(['population'])
.head())
## state year population
## 44 VT 1985 530000.0
## 92 VT 1995 582827.0
## 3 AZ 1985 3184000.0
## 1 AL 1985 3973000.0
## 49 AL 1995 4262731.0
Mutate, groupby and summarize
Another common advanced operation is mutate, groupby and then summarize. In this type of operation, you want to first create some column and then summarize that column within each group. Let’s groupby and find maximum within each group.
In R:
py$data %>%
mutate(population_1000 = population / 1000) %>%
group_by(state) %>%
summarize(max(population_1000)) %>%
ungroup() %>%
head(.,5)
## # A tibble: 5 × 2
## state `max(population_1000)`
## <chr> <dbl>
## 1 AL 4263.
## 2 AR 2480.
## 3 AZ 4307.
## 4 CA 31494.
## 5 CO 3738.
In python:
(data[['state','year','population']] # <-- select
.assign(population_1000 = lambda d : d.population / 1000) # <-- mutate
.groupby('state')
.population_1000
.max()
.head())
## state
## AL 4262.731
## AR 2480.121
## AZ 4306.908
## CA 31493.524
## CO 3738.061
## Name: population_1000, dtype: float64
Using aggregate
function instead of max
.
(data[['state','year','population','packs']]
.assign(population_1000 = lambda d: d.population / 1000)
.groupby('state')
.aggregate({'population_1000':'max'})
.head())
## population_1000
## state
## AL 4262.731
## AR 2480.121
## AZ 4306.908
## CA 31493.524
## CO 3738.061
Groupby and mutate
Another advanced operation is groupby followed by mutate. In this type of operations, you want to apply some operation with every group.
In R:
py$data %>%
select(state,year,population,packs) %>%
mutate(population_1000 = population / 1000) %>%
group_by(state) %>%
mutate(mean_packs = mean(packs)) %>%
ungroup() %>%
head(.,5)
## # A tibble: 5 × 6
## state year population packs population_1000 mean_packs
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AL 1985 3973000 116. 3973 109.
## 2 AR 1985 2327000 129. 2327 120.
## 3 AZ 1985 3184000 105. 3184 88.2
## 4 CA 1985 26444000 100. 26444 78.6
## 5 CO 1985 3209000 113. 3209 97.8
In python:
(data[['state','year','population','packs']]
.assign(population_1000 = lambda d : d.population / 1000,
mean_packs = lambda d : d.groupby('state')
.packs
.transform(lambda x : x.mean()))
.head())
## state year population packs population_1000 mean_packs
## 1 AL 1985 3973000.0 116.486282 3973.0 108.785858
## 2 AR 1985 2327000.0 128.534592 2327.0 119.788780
## 3 AZ 1985 3184000.0 104.522614 3184.0 88.238392
## 4 CA 1985 26444000.0 100.363037 26444.0 78.611172
## 5 CO 1985 3209000.0 112.963539 3209.0 97.773232
Alternatively,
(data[['state','year','population','packs']]
.assign(population_1000 = lambda d : d.population / 1000,
mean_packs = lambda d : d.groupby('state')[['packs']]
.transform(lambda x : x.mean()))
.head())
## state year population packs population_1000 mean_packs
## 1 AL 1985 3973000.0 116.486282 3973.0 108.785858
## 2 AR 1985 2327000.0 128.534592 2327.0 119.788780
## 3 AZ 1985 3184000.0 104.522614 3184.0 88.238392
## 4 CA 1985 26444000.0 100.363037 26444.0 78.611172
## 5 CO 1985 3209000.0 112.963539 3209.0 97.773232
Conditional mutate - ifelse
Another common type of advanced operation is conditional mutate. Here you want to create a column based on a condition that applies to another column(s).
In R:
py$data %>%
select(state,year,population,packs) %>%
mutate(population_1000 = population / 1000,
ifelsecol = ifelse(packs > 120,"> 120","< 120")) %>%
head(.,5)
## state year population packs population_1000 ifelsecol
## 1 AL 1985 3973000 116.4863 3973 < 120
## 2 AR 1985 2327000 128.5346 2327 > 120
## 3 AZ 1985 3184000 104.5226 3184 < 120
## 4 CA 1985 26444000 100.3630 26444 < 120
## 5 CO 1985 3209000 112.9635 3209 < 120
In python:
(data[['state','year','population','packs']]
.assign(population_1000 = lambda d : d.population / 1000,
ifelsecol = lambda d : np.where(d.packs > 120, "> 120","< 120"))
.head())
## state year population packs population_1000 ifelsecol
## 1 AL 1985 3973000.0 116.486282 3973.0 < 120
## 2 AR 1985 2327000.0 128.534592 2327.0 > 120
## 3 AZ 1985 3184000.0 104.522614 3184.0 < 120
## 4 CA 1985 26444000.0 100.363037 26444.0 < 120
## 5 CO 1985 3209000.0 112.963539 3209.0 < 120
In my humble opinion, I think dplyr
syntax is way more intuitive than pandas
.
Multiple mutate
Sometimes, one needs to apply the same mutating function to multiple columns. Its rather cumbersome to repeat the same operation individually to every column.
In R:
py$data %>%
select(population,packs,income,tax) %>%
mutate(across(population:tax,~log(. + 1),.names = "log_{col}"))
## population packs income tax log_population log_packs log_income
## 1 3973000 116.48628 46014968 32.50000 15.19503 4.766322 17.64448
## 2 2327000 128.53459 26210736 37.00000 14.66009 4.863948 17.08168
## 3 3184000 104.52261 43956936 31.00000 14.97365 4.658925 17.59872
## 4 26444000 100.36304 447102816 26.00000 17.09054 4.618708 19.91830
## 5 3209000 112.96354 49466672 31.00000 14.98147 4.735879 17.71681
## 6 3201000 109.27835 60063368 42.00000 14.97897 4.703008 17.91091
## 7 618000 143.85114 9927301 30.00000 13.33425 4.975707 16.11080
## 8 11352000 122.18112 166919248 37.00000 16.24490 4.813656 18.93302
## 9 5963000 127.23462 78364336 28.00000 15.60108 4.853862 18.17688
## 10 2830000 113.74558 37902896 34.00000 14.85579 4.742717 17.45054
## 11 994000 103.01811 11577261 25.10000 13.80949 4.644565 16.26455
## 12 11401000 123.20848 176786352 28.00001 16.24921 4.821961 18.99045
## 13 5460000 137.63737 71751616 26.50000 15.51296 4.931862 18.08872
## 14 2428000 116.68040 34784360 32.00000 14.70258 4.767972 17.36468
## 15 3695000 186.03519 42703144 19.00000 15.12249 5.231297 17.56978
## 16 4409000 127.55727 53431900 32.00000 15.29916 4.856374 17.79392
## 17 5881000 115.67760 98328688 42.00000 15.58724 4.759415 18.40383
## 18 4414000 120.97871 74851664 29.00000 15.30029 4.803846 18.13102
## 19 1163000 128.11694 14575292 36.00000 13.96651 4.860719 16.49484
## 20 9077000 128.00485 133728040 37.00000 16.02125 4.859850 18.71132
## 21 4185000 112.90323 63152360 34.00000 15.24702 4.735349 17.96106
## 22 5001000 130.37393 69341920 29.00000 15.42515 4.878048 18.05456
## 23 2588000 117.04018 25678534 27.58333 14.76640 4.771025 17.06117
## 24 822000 104.25790 9785230 32.00000 13.61950 4.656414 16.09638
## 25 6255000 155.28377 79104656 18.00000 15.64889 5.051673 18.18628
## 26 677000 105.46529 8672948 34.00000 13.42543 4.667819 15.97572
## 27 1585000 107.38171 21778072 34.00000 14.27610 4.685659 16.89641
## 28 997000 197.99399 15767469 33.00000 13.81251 5.293275 16.57346
## 29 7566000 116.52128 133549208 41.00000 15.83918 4.766619 18.70998
## 30 1439000 88.74218 17258916 28.00000 14.17946 4.496941 16.66384
## 31 951000 141.95584 14581495 31.00000 13.76527 4.962536 16.49526
## 32 17794000 116.66292 297728512 37.00000 16.69437 4.767824 19.51169
## 33 10736000 127.59874 153455776 30.00000 16.18911 4.856697 18.84892
## 34 3272000 127.13937 43395580 34.00000 15.00091 4.853118 17.58587
## 35 2673000 119.45380 36205164 35.00000 14.79871 4.791266 17.40471
## 36 11772000 117.70303 170033840 34.00000 16.28123 4.776625 18.95151
## 37 967000 132.78178 14229156 39.00000 13.78195 4.896210 16.47080
## 38 3304000 127.20944 38536176 23.00000 15.01064 4.853665 17.46711
## 39 698000 106.59026 8340000 31.00000 13.45598 4.678330 15.93657
## 40 4716000 129.83459 57749668 29.00000 15.36647 4.873934 17.87163
## 41 16275000 115.10293 231003152 35.25000 16.60514 4.754477 19.25794
## 42 1643000 68.04626 19462380 28.00000 14.31204 4.234777 16.78399
## 43 5716000 134.00980 87361632 18.50000 15.55878 4.905347 18.28557
## 44 530000 145.28302 6887097 33.00000 13.18063 4.985543 15.74516
## 45 4401000 96.22813 64846548 39.00000 15.29734 4.577060 17.98753
## 46 4748000 107.87700 65732720 41.00000 15.37323 4.690219 18.00111
## 47 1907000 112.84740 20852964 33.00000 14.46104 4.734859 16.85301
## 48 500000 129.39999 7116756 24.00000 13.12237 4.870607 15.77796
## 49 4262731 101.08543 83903280 40.50000 15.26542 4.625810 18.24518
## 50 2480121 111.04297 45995496 55.50000 14.72382 4.718882 17.64405
## 51 4306908 71.95417 88870496 65.33333 15.27573 4.289831 18.30269
## 52 31493524 56.85931 771470144 61.00000 17.26529 4.058014 20.46381
## 53 3738061 82.58292 92946544 44.00000 15.13408 4.425839 18.34754
## 54 3265293 79.47219 104315120 74.00000 14.99886 4.387912 18.46293
## 55 718265 124.46660 18237436 48.00000 13.48460 4.832040 16.71899
## 56 14185403 93.07455 333525344 57.90000 16.46772 4.544088 19.62523
## 57 7188538 97.47462 159800448 36.00000 15.78800 4.589799 18.88944
## 58 2840860 92.40160 60170928 60.00000 14.85962 4.536908 17.91270
## 59 1165000 74.84978 22868920 52.00000 13.96823 4.328755 16.94529
## 60 11884935 83.26508 304767456 68.00001 16.29078 4.433967 19.53506
## 61 5791819 134.25835 126525008 39.50000 15.57196 4.907187 18.65595
## 62 2586942 88.75344 56626672 48.00000 14.76599 4.497066 17.85199
## 63 3855248 172.64778 74079712 27.00000 15.16495 5.157029 18.12065
## 64 4327978 105.17613 84572688 44.00000 15.28061 4.665099 18.25312
## 65 6062335 76.62064 170051568 75.00000 15.61761 4.351833 18.95161
## 66 5023650 77.47355 135115456 60.00000 15.42967 4.362762 18.72164
## 67 1237438 102.46978 25045934 61.00000 14.02855 4.639280 17.03622
## 68 9659871 81.38825 231594240 99.00000 16.08349 4.411443 19.26050
## 69 4605445 82.94530 113216856 72.00000 15.34275 4.430165 18.54482
## 70 5324610 122.45028 117639672 41.00000 15.48785 4.815838 18.58314
## 71 2690788 105.58245 46241956 42.00000 14.80535 4.668919 17.64940
## 72 868522 87.15957 16296835 42.00000 13.67455 4.479148 16.60648
## 73 7185403 121.53806 157633568 29.00000 15.78756 4.808422 18.87578
## 74 641548 79.80697 12243384 68.00000 13.37164 4.392063 16.32050
## 75 1635142 87.27071 36293064 58.00000 14.30724 4.480408 17.40714
## 76 1145604 156.33675 28649564 49.00000 13.95144 5.058388 17.17065
## 77 7965523 80.37137 233208576 64.00000 15.89063 4.399023 19.26744
## 78 1682417 64.66887 31716160 45.00000 14.33574 4.184625 17.27234
## 79 1525777 93.52612 39377292 59.00000 14.23802 4.548876 17.48870
## 80 18150928 70.81732 503163328 80.00000 16.71423 4.274126 20.03643
## 81 11155493 111.38010 255312928 48.00000 16.22744 4.721887 19.35800
## 82 3265547 108.68011 63333300 47.00000 14.99894 4.697568 17.96392
## 83 3141421 92.15575 71209312 62.00000 14.96019 4.534273 18.08113
## 84 12044780 95.64309 285923232 55.00000 16.30414 4.571025 19.47123
## 85 989203 92.59980 23786644 80.00000 13.80466 4.539028 16.98463
## 86 3699943 108.08275 72050072 31.00000 15.12383 4.692107 18.09287
## 87 728251 97.21923 14454129 47.00000 13.49840 4.587202 16.48649
## 88 5241168 122.32005 114259984 37.00000 15.47206 4.814783 18.55399
## 89 18679706 73.07931 402096768 65.00000 16.74295 4.305136 19.81220
## 90 1976774 49.27220 37278220 50.50000 14.49698 3.917452 17.43392
## 91 6601392 105.38687 161441792 26.50001 15.70279 4.667082 18.89966
## 92 582827 122.33475 12448607 44.00000 13.27565 4.814902 16.33712
## 93 5431024 65.53092 129680832 80.50000 15.50764 4.197667 18.68059
## 94 5137004 92.46635 115959680 62.00000 15.45198 4.537601 18.56875
## 95 1820560 115.56883 32611268 41.00000 14.41466 4.758482 17.30017
## 96 478447 112.23814 10293195 36.00000 13.07830 4.729493 16.14699
## log_tax
## 1 3.511546
## 2 3.637586
## 3 3.465736
## 4 3.295837
## 5 3.465736
## 6 3.761200
## 7 3.433987
## 8 3.637586
## 9 3.367296
## 10 3.555348
## 11 3.261935
## 12 3.367296
## 13 3.314186
## 14 3.496508
## 15 2.995732
## 16 3.496508
## 17 3.761200
## 18 3.401197
## 19 3.610918
## 20 3.637586
## 21 3.555348
## 22 3.401198
## 23 3.352824
## 24 3.496508
## 25 2.944439
## 26 3.555348
## 27 3.555348
## 28 3.526361
## 29 3.737670
## 30 3.367296
## 31 3.465736
## 32 3.637586
## 33 3.433987
## 34 3.555348
## 35 3.583519
## 36 3.555348
## 37 3.688879
## 38 3.178054
## 39 3.465736
## 40 3.401197
## 41 3.590439
## 42 3.367296
## 43 2.970415
## 44 3.526361
## 45 3.688879
## 46 3.737670
## 47 3.526361
## 48 3.218876
## 49 3.725694
## 50 4.034241
## 51 4.194692
## 52 4.127134
## 53 3.806662
## 54 4.317488
## 55 3.891820
## 56 4.075841
## 57 3.610918
## 58 4.110874
## 59 3.970292
## 60 4.234107
## 61 3.701302
## 62 3.891820
## 63 3.332205
## 64 3.806662
## 65 4.330733
## 66 4.110874
## 67 4.127134
## 68 4.605170
## 69 4.290459
## 70 3.737670
## 71 3.761200
## 72 3.761200
## 73 3.401197
## 74 4.234107
## 75 4.077537
## 76 3.912023
## 77 4.174387
## 78 3.828641
## 79 4.094345
## 80 4.394449
## 81 3.891820
## 82 3.871201
## 83 4.143135
## 84 4.025352
## 85 4.394449
## 86 3.465736
## 87 3.871201
## 88 3.637586
## 89 4.189655
## 90 3.941582
## 91 3.314186
## 92 3.806662
## 93 4.400603
## 94 4.143135
## 95 3.737670
## 96 3.610918
In python:
(data
.assign(population_1000 = lambda d : d.population / 1000,
ifelsecol = lambda d : np.where(d.packs > 120, "> 120","< 120"))
.head())
## state year cpi ... taxs population_1000 ifelsecol
## 1 AL 1985 1.076 ... 33.348335 3973.0 < 120
## 2 AR 1985 1.076 ... 37.000000 2327.0 > 120
## 3 AZ 1985 1.076 ... 36.170418 3184.0 < 120
## 4 CA 1985 1.076 ... 32.104000 26444.0 < 120
## 5 CO 1985 1.076 ... 31.000000 3209.0 < 120
##
## [5 rows x 11 columns]