Advanced Pandas

Dplyr style coding in Pandas

By Pallav Routh in Python

September 12, 2021

In this blog post, I demonstrate how to replicate dplyr style data manipulation in pandas. A characteristic feature of dplyr is its ability to chain together multiple operations using the infamous %>% operator. I will show dplyr and pandas code side by side which will further highlight similarities and differences between the two packages. Hopefully, this helps users migrate from dplyr to pandas.

Loading pandas and dplyr

import pandas as pd
import numpy as np
suppressPackageStartupMessages(library(dplyr))

Data

We are going to use the popular cigarettes dataset. Let’s read in the data set.

data = pd.read_csv('https://raw.githubusercontent.com/pallavrouth/AI-Bootcamp/main/Data/cigarettes.csv', index_col = 'Unnamed: 0')

Here is quick overview -

glimpse(py$data)
## Rows: 96
## Columns: 9
## $ state      <chr> "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "IA",…
## $ year       <dbl> 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985, 1985,…
## $ cpi        <dbl> 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.076, 1.0…
## $ population <dbl> 3973000, 2327000, 3184000, 26444000, 3209000, 3201000, 6180…
## $ packs      <dbl> 116.4863, 128.5346, 104.5226, 100.3630, 112.9635, 109.2784,…
## $ income     <dbl> 46014968, 26210736, 43956936, 447102816, 49466672, 60063368…
## $ tax        <dbl> 32.50000, 37.00000, 31.00000, 26.00000, 31.00000, 42.00000,…
## $ price      <dbl> 102.18167, 101.47500, 108.57875, 107.83734, 94.26666, 128.0…
## $ taxs       <dbl> 33.34834, 37.00000, 36.17042, 32.10400, 31.00000, 51.48333,…

Chaining in pandas

To demonstrate chaining in pandas, lets perform the following operation on the dataset : select 3 columns and filter rows corresponding to certain states.

In R:

py$data %>% 
  select(state,year,population) %>% 
  filter(state %in% c('AL','AZ','VT')) %>% 
  head(.,5)
##    state year population
## 1     AL 1985    3973000
## 3     AZ 1985    3184000
## 44    VT 1985     530000
## 49    AL 1995    4262731
## 51    AZ 1995    4306908

In python:

(
  data
    .loc[:,['state','year','population']] 
    .loc[lambda d : d.state.isin(['AL','AZ','VT'])] 
    .head()
)
##    state  year  population
## 1     AL  1985   3973000.0
## 3     AZ  1985   3184000.0
## 44    VT  1985    530000.0
## 49    AL  1995   4262731.0
## 51    AZ  1995   4306908.0

In python, you need to do two things to employ chaining : (1) use a parenthesis to enclose the set of operations and (2) use the ‘.’ symbol at the end of each operation as a replacement for %>%.

Now, lets look at some common set of operations in dplyr.

Select, filter and sort

The first advanced operation is selecting columns following by filtering some rows followed by sorting. The sequence may not be same in all scenarios.

In R:

py$data %>% 
  select(state,year,population) %>% 
  filter(state %in% c('AL','AZ','VT') & population > 10e3) %>% 
  arrange(population) %>% 
  head(.,5)
##    state year population
## 44    VT 1985     530000
## 92    VT 1995     582827
## 3     AZ 1985    3184000
## 1     AL 1985    3973000
## 49    AL 1995    4262731

In python:

(data
  .loc[:,['state','year','population']] 
  .loc[lambda d : (d.state.isin(['AL','AZ','VT'])) & (d.population > 10e3)] 
  .sort_values(['population'])
  .head()) 
##    state  year  population
## 44    VT  1985    530000.0
## 92    VT  1995    582827.0
## 3     AZ  1985   3184000.0
## 1     AL  1985   3973000.0
## 49    AL  1995   4262731.0

For multiple selections, I prefer using query instead of loc.

(data
  .loc[:,['state','year','population']] 
  .query('(population > 10e3) and (state in ["AL","AZ","VT"])') 
  .sort_values(['population'])
  .head())
##    state  year  population
## 44    VT  1985    530000.0
## 92    VT  1995    582827.0
## 3     AZ  1985   3184000.0
## 1     AL  1985   3973000.0
## 49    AL  1995   4262731.0

Mutate, groupby and summarize

Another common advanced operation is mutate, groupby and then summarize. In this type of operation, you want to first create some column and then summarize that column within each group. Let’s groupby and find maximum within each group.

In R:

py$data %>% 
  mutate(population_1000 = population / 1000) %>% 
  group_by(state) %>% 
    summarize(max(population_1000)) %>% 
  ungroup() %>% 
  head(.,5)
## # A tibble: 5 × 2
##   state `max(population_1000)`
##   <chr>                  <dbl>
## 1 AL                     4263.
## 2 AR                     2480.
## 3 AZ                     4307.
## 4 CA                    31494.
## 5 CO                     3738.

In python:

(data[['state','year','population']] # <-- select
  .assign(population_1000 = lambda d : d.population / 1000) # <-- mutate
  .groupby('state') 
    .population_1000 
    .max()
  .head())
## state
## AL     4262.731
## AR     2480.121
## AZ     4306.908
## CA    31493.524
## CO     3738.061
## Name: population_1000, dtype: float64

Using aggregate function instead of max.

(data[['state','year','population','packs']]
  .assign(population_1000 = lambda d: d.population / 1000)
  .groupby('state')
    .aggregate({'population_1000':'max'})
  .head())
##        population_1000
## state                 
## AL            4262.731
## AR            2480.121
## AZ            4306.908
## CA           31493.524
## CO            3738.061

Groupby and mutate

Another advanced operation is groupby followed by mutate. In this type of operations, you want to apply some operation with every group.

In R:

py$data %>% 
  select(state,year,population,packs) %>% 
  mutate(population_1000 = population / 1000) %>% 
  group_by(state) %>% 
    mutate(mean_packs = mean(packs)) %>% 
  ungroup() %>% 
  head(.,5)
## # A tibble: 5 × 6
##   state  year population packs population_1000 mean_packs
##   <chr> <dbl>      <dbl> <dbl>           <dbl>      <dbl>
## 1 AL     1985    3973000  116.            3973      109. 
## 2 AR     1985    2327000  129.            2327      120. 
## 3 AZ     1985    3184000  105.            3184       88.2
## 4 CA     1985   26444000  100.           26444       78.6
## 5 CO     1985    3209000  113.            3209       97.8

In python:

(data[['state','year','population','packs']]
  .assign(population_1000 = lambda d : d.population / 1000,
          mean_packs = lambda d : d.groupby('state')  
                                      .packs 
                                      .transform(lambda x : x.mean()))
  .head())
##   state  year  population       packs  population_1000  mean_packs
## 1    AL  1985   3973000.0  116.486282           3973.0  108.785858
## 2    AR  1985   2327000.0  128.534592           2327.0  119.788780
## 3    AZ  1985   3184000.0  104.522614           3184.0   88.238392
## 4    CA  1985  26444000.0  100.363037          26444.0   78.611172
## 5    CO  1985   3209000.0  112.963539           3209.0   97.773232

Alternatively,

(data[['state','year','population','packs']]
  .assign(population_1000 = lambda d : d.population / 1000,
          mean_packs = lambda d : d.groupby('state')[['packs']]
                                      .transform(lambda x : x.mean()))
  .head())
##   state  year  population       packs  population_1000  mean_packs
## 1    AL  1985   3973000.0  116.486282           3973.0  108.785858
## 2    AR  1985   2327000.0  128.534592           2327.0  119.788780
## 3    AZ  1985   3184000.0  104.522614           3184.0   88.238392
## 4    CA  1985  26444000.0  100.363037          26444.0   78.611172
## 5    CO  1985   3209000.0  112.963539           3209.0   97.773232

Conditional mutate - ifelse

Another common type of advanced operation is conditional mutate. Here you want to create a column based on a condition that applies to another column(s).

In R:

py$data %>% 
  select(state,year,population,packs) %>% 
  mutate(population_1000 = population / 1000,
         ifelsecol = ifelse(packs > 120,"> 120","< 120")) %>% 
  head(.,5)
##   state year population    packs population_1000 ifelsecol
## 1    AL 1985    3973000 116.4863            3973     < 120
## 2    AR 1985    2327000 128.5346            2327     > 120
## 3    AZ 1985    3184000 104.5226            3184     < 120
## 4    CA 1985   26444000 100.3630           26444     < 120
## 5    CO 1985    3209000 112.9635            3209     < 120

In python:

(data[['state','year','population','packs']]
  .assign(population_1000 = lambda d : d.population / 1000,
          ifelsecol = lambda d : np.where(d.packs > 120, "> 120","< 120"))
  .head())
##   state  year  population       packs  population_1000 ifelsecol
## 1    AL  1985   3973000.0  116.486282           3973.0     < 120
## 2    AR  1985   2327000.0  128.534592           2327.0     > 120
## 3    AZ  1985   3184000.0  104.522614           3184.0     < 120
## 4    CA  1985  26444000.0  100.363037          26444.0     < 120
## 5    CO  1985   3209000.0  112.963539           3209.0     < 120

In my humble opinion, I think dplyr syntax is way more intuitive than pandas.

Multiple mutate

Sometimes, one needs to apply the same mutating function to multiple columns. Its rather cumbersome to repeat the same operation individually to every column.

In R:

py$data %>% 
  select(population,packs,income,tax) %>%
  mutate(across(population:tax,~log(. + 1),.names = "log_{col}"))
##    population     packs    income      tax log_population log_packs log_income
## 1     3973000 116.48628  46014968 32.50000       15.19503  4.766322   17.64448
## 2     2327000 128.53459  26210736 37.00000       14.66009  4.863948   17.08168
## 3     3184000 104.52261  43956936 31.00000       14.97365  4.658925   17.59872
## 4    26444000 100.36304 447102816 26.00000       17.09054  4.618708   19.91830
## 5     3209000 112.96354  49466672 31.00000       14.98147  4.735879   17.71681
## 6     3201000 109.27835  60063368 42.00000       14.97897  4.703008   17.91091
## 7      618000 143.85114   9927301 30.00000       13.33425  4.975707   16.11080
## 8    11352000 122.18112 166919248 37.00000       16.24490  4.813656   18.93302
## 9     5963000 127.23462  78364336 28.00000       15.60108  4.853862   18.17688
## 10    2830000 113.74558  37902896 34.00000       14.85579  4.742717   17.45054
## 11     994000 103.01811  11577261 25.10000       13.80949  4.644565   16.26455
## 12   11401000 123.20848 176786352 28.00001       16.24921  4.821961   18.99045
## 13    5460000 137.63737  71751616 26.50000       15.51296  4.931862   18.08872
## 14    2428000 116.68040  34784360 32.00000       14.70258  4.767972   17.36468
## 15    3695000 186.03519  42703144 19.00000       15.12249  5.231297   17.56978
## 16    4409000 127.55727  53431900 32.00000       15.29916  4.856374   17.79392
## 17    5881000 115.67760  98328688 42.00000       15.58724  4.759415   18.40383
## 18    4414000 120.97871  74851664 29.00000       15.30029  4.803846   18.13102
## 19    1163000 128.11694  14575292 36.00000       13.96651  4.860719   16.49484
## 20    9077000 128.00485 133728040 37.00000       16.02125  4.859850   18.71132
## 21    4185000 112.90323  63152360 34.00000       15.24702  4.735349   17.96106
## 22    5001000 130.37393  69341920 29.00000       15.42515  4.878048   18.05456
## 23    2588000 117.04018  25678534 27.58333       14.76640  4.771025   17.06117
## 24     822000 104.25790   9785230 32.00000       13.61950  4.656414   16.09638
## 25    6255000 155.28377  79104656 18.00000       15.64889  5.051673   18.18628
## 26     677000 105.46529   8672948 34.00000       13.42543  4.667819   15.97572
## 27    1585000 107.38171  21778072 34.00000       14.27610  4.685659   16.89641
## 28     997000 197.99399  15767469 33.00000       13.81251  5.293275   16.57346
## 29    7566000 116.52128 133549208 41.00000       15.83918  4.766619   18.70998
## 30    1439000  88.74218  17258916 28.00000       14.17946  4.496941   16.66384
## 31     951000 141.95584  14581495 31.00000       13.76527  4.962536   16.49526
## 32   17794000 116.66292 297728512 37.00000       16.69437  4.767824   19.51169
## 33   10736000 127.59874 153455776 30.00000       16.18911  4.856697   18.84892
## 34    3272000 127.13937  43395580 34.00000       15.00091  4.853118   17.58587
## 35    2673000 119.45380  36205164 35.00000       14.79871  4.791266   17.40471
## 36   11772000 117.70303 170033840 34.00000       16.28123  4.776625   18.95151
## 37     967000 132.78178  14229156 39.00000       13.78195  4.896210   16.47080
## 38    3304000 127.20944  38536176 23.00000       15.01064  4.853665   17.46711
## 39     698000 106.59026   8340000 31.00000       13.45598  4.678330   15.93657
## 40    4716000 129.83459  57749668 29.00000       15.36647  4.873934   17.87163
## 41   16275000 115.10293 231003152 35.25000       16.60514  4.754477   19.25794
## 42    1643000  68.04626  19462380 28.00000       14.31204  4.234777   16.78399
## 43    5716000 134.00980  87361632 18.50000       15.55878  4.905347   18.28557
## 44     530000 145.28302   6887097 33.00000       13.18063  4.985543   15.74516
## 45    4401000  96.22813  64846548 39.00000       15.29734  4.577060   17.98753
## 46    4748000 107.87700  65732720 41.00000       15.37323  4.690219   18.00111
## 47    1907000 112.84740  20852964 33.00000       14.46104  4.734859   16.85301
## 48     500000 129.39999   7116756 24.00000       13.12237  4.870607   15.77796
## 49    4262731 101.08543  83903280 40.50000       15.26542  4.625810   18.24518
## 50    2480121 111.04297  45995496 55.50000       14.72382  4.718882   17.64405
## 51    4306908  71.95417  88870496 65.33333       15.27573  4.289831   18.30269
## 52   31493524  56.85931 771470144 61.00000       17.26529  4.058014   20.46381
## 53    3738061  82.58292  92946544 44.00000       15.13408  4.425839   18.34754
## 54    3265293  79.47219 104315120 74.00000       14.99886  4.387912   18.46293
## 55     718265 124.46660  18237436 48.00000       13.48460  4.832040   16.71899
## 56   14185403  93.07455 333525344 57.90000       16.46772  4.544088   19.62523
## 57    7188538  97.47462 159800448 36.00000       15.78800  4.589799   18.88944
## 58    2840860  92.40160  60170928 60.00000       14.85962  4.536908   17.91270
## 59    1165000  74.84978  22868920 52.00000       13.96823  4.328755   16.94529
## 60   11884935  83.26508 304767456 68.00001       16.29078  4.433967   19.53506
## 61    5791819 134.25835 126525008 39.50000       15.57196  4.907187   18.65595
## 62    2586942  88.75344  56626672 48.00000       14.76599  4.497066   17.85199
## 63    3855248 172.64778  74079712 27.00000       15.16495  5.157029   18.12065
## 64    4327978 105.17613  84572688 44.00000       15.28061  4.665099   18.25312
## 65    6062335  76.62064 170051568 75.00000       15.61761  4.351833   18.95161
## 66    5023650  77.47355 135115456 60.00000       15.42967  4.362762   18.72164
## 67    1237438 102.46978  25045934 61.00000       14.02855  4.639280   17.03622
## 68    9659871  81.38825 231594240 99.00000       16.08349  4.411443   19.26050
## 69    4605445  82.94530 113216856 72.00000       15.34275  4.430165   18.54482
## 70    5324610 122.45028 117639672 41.00000       15.48785  4.815838   18.58314
## 71    2690788 105.58245  46241956 42.00000       14.80535  4.668919   17.64940
## 72     868522  87.15957  16296835 42.00000       13.67455  4.479148   16.60648
## 73    7185403 121.53806 157633568 29.00000       15.78756  4.808422   18.87578
## 74     641548  79.80697  12243384 68.00000       13.37164  4.392063   16.32050
## 75    1635142  87.27071  36293064 58.00000       14.30724  4.480408   17.40714
## 76    1145604 156.33675  28649564 49.00000       13.95144  5.058388   17.17065
## 77    7965523  80.37137 233208576 64.00000       15.89063  4.399023   19.26744
## 78    1682417  64.66887  31716160 45.00000       14.33574  4.184625   17.27234
## 79    1525777  93.52612  39377292 59.00000       14.23802  4.548876   17.48870
## 80   18150928  70.81732 503163328 80.00000       16.71423  4.274126   20.03643
## 81   11155493 111.38010 255312928 48.00000       16.22744  4.721887   19.35800
## 82    3265547 108.68011  63333300 47.00000       14.99894  4.697568   17.96392
## 83    3141421  92.15575  71209312 62.00000       14.96019  4.534273   18.08113
## 84   12044780  95.64309 285923232 55.00000       16.30414  4.571025   19.47123
## 85     989203  92.59980  23786644 80.00000       13.80466  4.539028   16.98463
## 86    3699943 108.08275  72050072 31.00000       15.12383  4.692107   18.09287
## 87     728251  97.21923  14454129 47.00000       13.49840  4.587202   16.48649
## 88    5241168 122.32005 114259984 37.00000       15.47206  4.814783   18.55399
## 89   18679706  73.07931 402096768 65.00000       16.74295  4.305136   19.81220
## 90    1976774  49.27220  37278220 50.50000       14.49698  3.917452   17.43392
## 91    6601392 105.38687 161441792 26.50001       15.70279  4.667082   18.89966
## 92     582827 122.33475  12448607 44.00000       13.27565  4.814902   16.33712
## 93    5431024  65.53092 129680832 80.50000       15.50764  4.197667   18.68059
## 94    5137004  92.46635 115959680 62.00000       15.45198  4.537601   18.56875
## 95    1820560 115.56883  32611268 41.00000       14.41466  4.758482   17.30017
## 96     478447 112.23814  10293195 36.00000       13.07830  4.729493   16.14699
##     log_tax
## 1  3.511546
## 2  3.637586
## 3  3.465736
## 4  3.295837
## 5  3.465736
## 6  3.761200
## 7  3.433987
## 8  3.637586
## 9  3.367296
## 10 3.555348
## 11 3.261935
## 12 3.367296
## 13 3.314186
## 14 3.496508
## 15 2.995732
## 16 3.496508
## 17 3.761200
## 18 3.401197
## 19 3.610918
## 20 3.637586
## 21 3.555348
## 22 3.401198
## 23 3.352824
## 24 3.496508
## 25 2.944439
## 26 3.555348
## 27 3.555348
## 28 3.526361
## 29 3.737670
## 30 3.367296
## 31 3.465736
## 32 3.637586
## 33 3.433987
## 34 3.555348
## 35 3.583519
## 36 3.555348
## 37 3.688879
## 38 3.178054
## 39 3.465736
## 40 3.401197
## 41 3.590439
## 42 3.367296
## 43 2.970415
## 44 3.526361
## 45 3.688879
## 46 3.737670
## 47 3.526361
## 48 3.218876
## 49 3.725694
## 50 4.034241
## 51 4.194692
## 52 4.127134
## 53 3.806662
## 54 4.317488
## 55 3.891820
## 56 4.075841
## 57 3.610918
## 58 4.110874
## 59 3.970292
## 60 4.234107
## 61 3.701302
## 62 3.891820
## 63 3.332205
## 64 3.806662
## 65 4.330733
## 66 4.110874
## 67 4.127134
## 68 4.605170
## 69 4.290459
## 70 3.737670
## 71 3.761200
## 72 3.761200
## 73 3.401197
## 74 4.234107
## 75 4.077537
## 76 3.912023
## 77 4.174387
## 78 3.828641
## 79 4.094345
## 80 4.394449
## 81 3.891820
## 82 3.871201
## 83 4.143135
## 84 4.025352
## 85 4.394449
## 86 3.465736
## 87 3.871201
## 88 3.637586
## 89 4.189655
## 90 3.941582
## 91 3.314186
## 92 3.806662
## 93 4.400603
## 94 4.143135
## 95 3.737670
## 96 3.610918

In python:

(data
  .assign(population_1000 = lambda d : d.population / 1000,
          ifelsecol = lambda d : np.where(d.packs > 120, "> 120","< 120"))
  .head())
##   state  year    cpi  ...       taxs  population_1000  ifelsecol
## 1    AL  1985  1.076  ...  33.348335           3973.0      < 120
## 2    AR  1985  1.076  ...  37.000000           2327.0      > 120
## 3    AZ  1985  1.076  ...  36.170418           3184.0      < 120
## 4    CA  1985  1.076  ...  32.104000          26444.0      < 120
## 5    CO  1985  1.076  ...  31.000000           3209.0      < 120
## 
## [5 rows x 11 columns]