Group by & Aggregate using Pandas

Last updated: 25th Mar 2017
Akshay Sehgal, www.akshaysehgal.com
Data downloadable here

Data Grouping is probably the most used concept in the field of data analysis. Almost every scripting language builds its foundation over grouping data by categories of a multi-dimensional variable. A data scientist uses this for summarizing data for analysis as well as changing the level at which data can be useful for a model. Example, transaction level data needs to be summarized at customer level data before predicting their spend. Usecases like these are where languages like SQL are very useful with their group by clauses. However python isn't too far behind. Pandas provides a large variety of methods which do so much more than the standard SQL grouping. This combined with the aggregate methods gives a Data Scientist a strong grasp over data handling. The objective of this notebook is to explore group by and aggregation methods on data using python library Pandas.

1. Introduction

SQL groupby is probably the most popular feature for data transformation and it helps to be able to replicate the same form of data manipulation techniques using python for designing more advance data science systems. As a result, its important to understand the basic components of a groupby clause.

  • Select - Is the list of aggregated features that the analyst is interested in
  • From - Source of the data
  • Group By - Feature(s) whose distinct values will be the basis of the grouping of selected aggregate features
  • Where - Any additional conditions that need to be checked on the raw data, before grouping up the data
  • Having - Any additional conditions that need to be checked on OUTPUT of the group by query, before displaying it

Keeping these concepts in mind, the Pandas groupby method will be explored in detail below.

In [2]:
#Set Work Directory
import os
os.chdir('C:\\Users\\Akshay\\Documents\\iPython\\Personal\\Practice')
os.getcwd()

#Import Titanic Data
import pandas as pd
df = pd.read_excel('titanic.xls')

2. Syntax

The core syntax can be broken down similar to the Select-From-Groupby-Where clause. Sample code is given below :

Table_name.groupby(['Group'])['Feature'].aggregation()

  • Table_name to specify the FROM
  • 'Group' is the list of GROUP BY variables
  • 'Feature' is the list of SELECT variables (with or without WHERE condition)
  • Aggregate() is to specify the aggregation
In [3]:
#Two step query to find sum of survived people, grouped by their passenger class (1 > 2 > 3)
group_survived = df.groupby(['Pclass'])
out_survived = group_survived['Survived'].sum()
print(out_survived)
Pclass
1    200
2    119
3    181
Name: Survived, dtype: int64
In [5]:
#Above snippet can be implemented in a single command as follows
out_survived = df.groupby(['Pclass'])['Survived'].sum()
print(out_survived)
Pclass
1    200
2    119
3    181
Name: Survived, dtype: int64

2.1 Adding more groups/levels

We can pass a list of features in the groupby() to increase the levels of divisions in data as below :

In [6]:
#Three level groupby to find mean of age
df.groupby(['Survived','Pclass','Sex'])['Age'].mean()
Out[6]:
Survived  Pclass  Sex   
0         1       female    35.200000
                  male      43.658163
          2       female    34.090909
                  male      33.092593
          3       female    23.418750
                  male      26.679598
1         1       female    37.109375
                  male      36.168240
          2       female    26.711051
                  male      17.449274
          3       female    20.814815
                  male      22.436441
Name: Age, dtype: float64

2.2 Adding more variables/features

Similarly, we can also pass a list of features after the groupby to increase the variables we want to aggregate, as below :

In [7]:
#Three level groupby to find mean of age and fares
#reset_index() just arranges the column names properly like a data frame
df.groupby(['Survived','Pclass','Sex'])['Age','Fare'].mean().reset_index()
Out[7]:
Survived Pclass Sex Age Fare
0 0 1 female 35.200000 117.054180
1 0 1 male 43.658163 68.436192
2 0 2 female 34.090909 19.125000
3 0 2 male 33.092593 19.857334
4 0 3 female 23.418750 18.367162
5 0 3 male 26.679598 12.134521
6 1 1 female 37.109375 109.137500
7 1 1 male 36.168240 72.697546
8 1 2 female 26.711051 23.759486
9 1 2 male 17.449274 20.183000
10 1 3 female 20.814815 12.166510
11 1 3 male 22.436441 13.977496

2.3 WHERE Clause

Adding a Where clause is quite intuitive as you can specify this as conditions before the groupby() method. This first applies the where condition on the dataframe, then groups it and aggregates given variables to throw results.

In [8]:
#Fare and Age average for only those who survived
df[df['Survived']==1].groupby(['Pclass','Sex'])['Age','Fare'].mean().reset_index()
Out[8]:
Pclass Sex Age Fare
0 1 female 37.109375 109.137500
1 1 male 36.168240 72.697546
2 2 female 26.711051 23.759486
3 2 male 17.449274 20.183000
4 3 female 20.814815 12.166510
5 3 male 22.436441 13.977496
In [9]:
#The same query above can be broken down into 3 steps for better understanding
df1 = df[df['Survived']==1]
grouped_data = df1.groupby(['Pclass','Sex'])
output = grouped_data['Age','Fare'].mean()
print(output.reset_index())
   Pclass     Sex        Age        Fare
0       1  female  37.109375  109.137500
1       1    male  36.168240   72.697546
2       2  female  26.711051   23.759486
3       2    male  17.449274   20.183000
4       3  female  20.814815   12.166510
5       3    male  22.436441   13.977496

2.4 Multiple Aggregations - Stepwise

Till now only one aggregation is being applied on variables in all the examples above. Next is how to create multiple types of aggregations on data. This task can be performed step by step with first grouping the table, next creating 1 aggregate variable at a time, then finally combining them into a single dataframe using pd.DataFrame()

In [10]:
##Step 1: Group by Gender
groupby1 = df.groupby(['Sex'])

##Step 2: Calculate different aggregations on 'Fare' variable
meanfare = groupby1['Fare'].mean()
maxfare = groupby1['Fare'].max()
minfare = groupby1['Fare'].min()
stdfare = groupby1['Fare'].std()
rangefare = maxfare-minfare  #Can also create custom aggregations

##Step 3: Combine into a single DataFrame
#Min, Mean, Max
farestats1 = pd.DataFrame({'meanfare':meanfare,'maxfare':maxfare,'minfare':minfare})
#Mean, Range, Standard deviation
farestats2 = pd.DataFrame({'meanfare':meanfare,'stdfare':stdfare,'rangefare':rangefare})

print(farestats1.reset_index())
print(farestats2.reset_index())
      Sex   maxfare   meanfare  minfare
0  female  512.3292  46.198097     6.75
1    male  512.3292  26.154601     0.00
      Sex   meanfare  rangefare    stdfare
0  female  46.198097   505.5792  63.292599
1    male  26.154601   512.3292  42.486877

2.5 Multiple Aggregations - using agg()

This is an advanced way of using multiple aggregations on different variables by use of AGG() and DICTIONARIES.

The difference between [ ] and { } parenthesis is that square brackets represent a list where each element is unique, while curly brackets represent a set(), where we have the ability to create dictionaries for later use. One such use of dictionaries is agg() method a.k.a aggregate method.

In [11]:
##First define the functions that need to be performed

#Dictionary 'f' uses 3 aggregations on same variable 'fare' 
f = {'Fare':['mean','max','min']}

#The dictionary is then passed into the aggregate() method
df.groupby(['Sex']).agg(f).reset_index()
Out[11]:
Sex Fare
mean max min
0 female 46.198097 512.3292 6.75
1 male 26.154601 512.3292 0.00
In [18]:
#Dictionary 'g' contains 2 separate aggregations on 2 different  variables 'fare' and 'age' respectively
g = {'fare':['mean','std'],'age':['mean','std']}
df.groupby(['sex']).agg(g).reset_index()
Out[18]:
sex fare age
mean std mean std
0 female 46.198097 63.292599 28.687071 14.576995
1 male 26.154601 42.486877 30.585233 14.280571

2.6 Renaming aggregated variables

Initialy while creating dictionaries, we used { } to define the first level of the dictionary, but the sub-levels were inputted still in [ ]. Here the only difference is that instead of passing a list [ ] into a dictionary element, we pass another dictionary to it, since we can associate labels to dictionary elements easily. For example :

  • { 'element1' : ['a','b'] } is a dictionary with list 'a','b' passed to element1.
  • { 'element1' : {'a','b'} } is a dictionary with dictionary 'a','b' passed to element1.
  • This allows adding labels to the dictionary inside as follows :

  • { 'element1' : { 'label1':'a' , 'label2':'b' }}
  • In [12]:
    #Dictionary h contains mean() as average, max() as maximas and min() as minimas, associated with variable 'fare'
    h = {'Fare':{'average':'mean','maximas':'max','minimas':'min'}}
    print(df.groupby(['Sex']).agg(h).reset_index())
    
          Sex       Fare                  
                 average   maximas minimas
    0  female  46.198097  512.3292    6.75
    1    male  26.154601  512.3292    0.00
    

    2.7 Custom Aggregations

    There are 2 ways of creating custom aggregations. One is using the step by step method above to create the custom aggregation (as shown previously with 'Rangefare' aggregation). The other method is using LAMBDA X to create the aggregation, as shown below

    In [13]:
    #Lambda function can be associated with a calculation as well as a label to create custom aggregations
    i = {'Fare':{'average':'mean','deviation':'std','range': lambda x : max(x)-min(x)}}
    print(df.groupby(['Sex']).agg(i).reset_index())
    
          Sex       Fare                     
                 average  deviation     range
    0  female  46.198097  63.292599  505.5792
    1    male  26.154601  42.486877  512.3292
    

    3. Pandas predefined methods

    This is a work in progress list of aggregate methods that can be used with groupby().

    In [14]:
    #Different methods can be called during pandas groupby and aggregate
    
    non_null_count = df.groupby(['Sex'])['Age'].count()
    summation = df.groupby(['Sex'])['Age'].sum()
    average = df.groupby(['Sex'])['Age'].mean()
    mean_absolute_dev = df.groupby(['Sex'])['Age'].mad()
    arithmetic_median = df.groupby(['Sex'])['Age'].median()
    maximum = df.groupby(['Sex'])['Age'].max()
    minimum = df.groupby(['Sex'])['Age'].min()
    product = df.groupby(['Sex'])['Age'].prod()
    unbiased_std_dev = df.groupby(['Sex'])['Age'].std()
    unbiased_variance = df.groupby(['Sex'])['Age'].var()
    unbiased_std_err_of_mean = df.groupby(['Sex'])['Age'].sem()
    unbiased_skewness_3rdmoment = df.groupby(['Sex'])['Age'].skew()
    
    #cumsum1 = df.groupby(['sex'])['age'].cumsum()
    #cumprod1 = df.groupby(['sex'])['age'].cumprod()
    #cummax1 = df.groupby(['sex'])['age'].cummax()
    #cummin1 = df.groupby(['sex'])['age'].cummin()
    #quantile = df.groupby(['sex'])['age'].quantile()
    #unbiased_kurtosis_4thmoment = df.groupby(['sex'])['age'].kurt()
    #arithmetic_mode = df.groupby(['sex'])['age'].mode()
    #absolute_value = df.groupby(['sex'])['age'].abs()
    
    print(arithmetic_median)
    
    Sex
    female    27.0
    male      28.0
    Name: Age, dtype: float64
    

    4. References

    • http://www.scipy-lectures.org/packages/statistics/index.html#hypothesis-testing-comparing-two-groups
    • https://www.simple-talk.com/sql/t-sql-programming/sql-group-by-basics/
    • http://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
    • http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html