pyjanitor
pyjanitor copied to clipboard
Feature Proposal: Auto Agg method
Brief Description
An auto_agg
method where you just put the columns you want and it automatically aggregates the dataframe.
For example, I could do something like:
df.auto_agg(['state', 'age_mean'])
On a dataframe that has a state
and age
column and it will automatically get the average age by state.
Example API
Here's some basic code:
import pandas as pd
colnames = ('id', 'state', 'state_abbr', 'amount')
df = pd.DataFrame.from_records((
('Person1', 'California', 'CA', 20),
('Person2', 'California', 'CA', 30),
('Person3', 'New York', 'NY', 15),
('Person4', 'New York', 'NY', 10),
('Person5', 'New York', 'NY', 45)), columns=colnames)
new_columns = ['state_abbr', 'state_count']
def flatten_dict(x):
return {k: v for d in x for k, v in d.items()}
def auto_agg(df, new_columns):
cols = set(df.columns)
new_columns_set = set(new_columns)
group_cols = cols.intersection(new_columns_set)
agg_cols = new_columns_set.difference(cols)
agg_groups = [{col: agg_col} for col in cols for agg_col in agg_cols if agg_col.startswith(col)]
this_dict = dict()
for i in agg_groups:
value = list(i.values())[0]
key = list(i.keys())[0]
this_dict[value] = ''
current_value = this_dict.get(value)
if len(key) > len(current_value):
this_dict[value] = key
agg_groups_rename = dict([(value, key) for key, value in this_dict.items()])
agg_groups_agg = [{i[0]: i[1].split(i[0])[1:][0][1:]} for i in agg_groups_rename.items()]
agg_groups_agg = flatten_dict(agg_groups_agg)
group_cols_list = list(group_cols)
return df.groupby(group_cols_list, as_index=False).agg(agg_groups_agg).rename(columns=agg_groups_rename)
auto_agg(df, new_columns)
There's a problem when column names are subsets of eachother, which is why I have an example with state_abbr
and state
. Sometimes the set math gets confused when trying to split the column names and the functions.
I think a basic workaround would be to be able to "force" the functions on specific columns without us needing to guess, like this:
new_columns = ['state', 'amount_sum', {'id': 'count'}]
auto_agg(df, new_columns)
# or df.auto_agg(new_columns)
If we say something like "if your choice of columns mess up, then be explicit" we might not need this part of the code, since we won't need to find the biggest or smallest version of the column name.
for i in agg_groups:
value = list(i.values())[0]
key = list(i.keys())[0]
this_dict[value] = ''
current_value = this_dict.get(value)
if len(key) > len(current_value):
this_dict[value] = key
@szuckerman Do you mind adding the expected output?
new_columns = ['state_abbr', 'state_count']
auto_agg(df, new_columns)
would result in:
state_abbr state_count
0 CA 2
1 NY 3
I.e. auto_agg
is a short form to writing this long groupby:
df1 = df.groupby('state_abbr', as_index=False).agg({'state': 'count'}).rename(columns={'state': 'state_count'})
df2 = auto_agg(df, new_columns)
df1.equals(df2)
>>>True
@szuckerman cool. would it be easier to test if state_count
already exists and raise an error? Also specify, or restrict the new column name to have a fixed separator _
?
The issue isn't whether state_count already exists, it's whether the grouping column name is a subset of another name.
For example, the way I'm parsing the columns, if someone wants state_abbr_count since there's a "state" column it will look for a function called "abbr_count" which doesn't exist.
If state_count is already in the dataframe then that column will be "grouped by" anyway.
But I think we could raise an error when it's ambiguous what column is wanted (similar to the ambiguous column error in SQL when joining tables).
@szuckerman I love it! This actually helps simplify the code that would otherwise get written by a ton!
Would you like to submit a PR for this?
So I think there's still quite a bit to add, namely:
- Ability to change delimiter between column name and aggregation metric
- Ability to have aggregation metric first (people might like count_state instead of state_count)
- raise an error when it can't be determined what column people want to use as the grouping column (usually is the case when there's a column name which is a subset of another column name in the dataframe)
- ability to explicitly choose aggregation metric (I.e. put {'state': 'count'} in the list instead of 'state_count')
- ability to have custom functions. This would work with the {'state': 'count'} syntax since one could just say {'state': custom_func} but making 'state_custom_func' work is a bit less trivial.
After these additions I think it will be ready for a PR.
@szuckerman I think those are great ideas! Though I'd also be in favour of getting something in first, getting it out into the wild, and then letting the feedback come back on what feature might be desired the most. (And then guiding the feature requester to submitting a PR too!) Gradual generalization is better; it's how I've approached most of the functions I've submitted. Your call, though! 😄
Btw, one of the things I like about the way you have the function setup is that I get a bit of "delightful magic" by specifying auto_agg(["state_count", "state_mean", "state_var"])
. For some reason, that feels delightful and magical simultaneously :).
@szuckerman just checking on this issue... Reread it and it is fascinating... Not sure how far you've gone with it... And if I can contribute