Brief Description

An auto_agg method where you just put the columns you want and it automatically aggregates the dataframe.

For example, I could do something like:

df.auto_agg(['state', 'age_mean'])

On a dataframe that has a state and age column and it will automatically get the average age by state.

Example API

Here's some basic code:


import pandas as pd

colnames = ('id', 'state', 'state_abbr', 'amount')

df = pd.DataFrame.from_records((
	('Person1', 'California', 'CA', 20),
	('Person2', 'California', 'CA', 30),
	('Person3', 'New York', 'NY', 15),
	('Person4', 'New York', 'NY', 10),
	('Person5', 'New York', 'NY', 45)), columns=colnames)


new_columns = ['state_abbr', 'state_count']


def flatten_dict(x):
    return {k: v for d in x for k, v in d.items()}


def auto_agg(df, new_columns):
    cols = set(df.columns)
    new_columns_set = set(new_columns)
    group_cols = cols.intersection(new_columns_set)
    agg_cols = new_columns_set.difference(cols)
    agg_groups = [{col: agg_col} for col in cols for agg_col in agg_cols if agg_col.startswith(col)]
    this_dict = dict()
    for i in agg_groups:
        value = list(i.values())[0]
        key = list(i.keys())[0]
        this_dict[value] = ''
        current_value = this_dict.get(value)
        if len(key) > len(current_value):
            this_dict[value] = key
    agg_groups_rename = dict([(value, key) for key, value in this_dict.items()])  
    agg_groups_agg = [{i[0]: i[1].split(i[0])[1:][0][1:]} for i in agg_groups_rename.items()]
    agg_groups_agg = flatten_dict(agg_groups_agg)
    group_cols_list = list(group_cols)
    return df.groupby(group_cols_list, as_index=False).agg(agg_groups_agg).rename(columns=agg_groups_rename)


auto_agg(df, new_columns)

There's a problem when column names are subsets of eachother, which is why I have an example with state_abbr and state. Sometimes the set math gets confused when trying to split the column names and the functions.

I think a basic workaround would be to be able to "force" the functions on specific columns without us needing to guess, like this:

new_columns = ['state', 'amount_sum', {'id': 'count'}]
auto_agg(df, new_columns)
# or df.auto_agg(new_columns)

If we say something like "if your choice of columns mess up, then be explicit" we might not need this part of the code, since we won't need to find the biggest or smallest version of the column name.

for i in agg_groups:
        value = list(i.values())[0]
        key = list(i.keys())[0]
        this_dict[value] = ''
        current_value = this_dict.get(value)
        if len(key) > len(current_value):
            this_dict[value] = key

Sep 04 '20 23:09 szuckerman

@szuckerman Do you mind adding the expected output?

Sep 05 '20 09:09 samukweku

new_columns = ['state_abbr', 'state_count']
auto_agg(df, new_columns)

would result in:

  state_abbr  state_count
0         CA            2
1         NY            3

I.e. auto_agg is a short form to writing this long groupby:


df1 = df.groupby('state_abbr', as_index=False).agg({'state': 'count'}).rename(columns={'state': 'state_count'})
df2 = auto_agg(df, new_columns)

df1.equals(df2)
>>>True

Sep 06 '20 01:09 szuckerman

@szuckerman cool. would it be easier to test if state_count already exists and raise an error? Also specify, or restrict the new column name to have a fixed separator _ ?

Sep 06 '20 20:09 samukweku

The issue isn't whether state_count already exists, it's whether the grouping column name is a subset of another name.

For example, the way I'm parsing the columns, if someone wants state_abbr_count since there's a "state" column it will look for a function called "abbr_count" which doesn't exist.

If state_count is already in the dataframe then that column will be "grouped by" anyway.

But I think we could raise an error when it's ambiguous what column is wanted (similar to the ambiguous column error in SQL when joining tables).

Sep 06 '20 20:09 szuckerman

@szuckerman I love it! This actually helps simplify the code that would otherwise get written by a ton!

Would you like to submit a PR for this?

Sep 06 '20 20:09 ericmjl

So I think there's still quite a bit to add, namely:

Ability to change delimiter between column name and aggregation metric
Ability to have aggregation metric first (people might like count_state instead of state_count)
raise an error when it can't be determined what column people want to use as the grouping column (usually is the case when there's a column name which is a subset of another column name in the dataframe)
ability to explicitly choose aggregation metric (I.e. put {'state': 'count'} in the list instead of 'state_count')
ability to have custom functions. This would work with the {'state': 'count'} syntax since one could just say {'state': custom_func} but making 'state_custom_func' work is a bit less trivial.

After these additions I think it will be ready for a PR.

Sep 06 '20 21:09 szuckerman

@szuckerman I think those are great ideas! Though I'd also be in favour of getting something in first, getting it out into the wild, and then letting the feedback come back on what feature might be desired the most. (And then guiding the feature requester to submitting a PR too!) Gradual generalization is better; it's how I've approached most of the functions I've submitted. Your call, though! 😄

Sep 06 '20 22:09 ericmjl

Btw, one of the things I like about the way you have the function setup is that I get a bit of "delightful magic" by specifying auto_agg(["state_count", "state_mean", "state_var"]). For some reason, that feels delightful and magical simultaneously :).

Sep 06 '20 22:09 ericmjl

@szuckerman just checking on this issue... Reread it and it is fascinating... Not sure how far you've gone with it... And if I can contribute

Sep 18 '21 12:09 samukweku

pyjanitor
pyjanitor copied to clipboard

Feature Proposal: Auto Agg method

Brief Description

Example API

pyjanitor pyjanitor copied to clipboard

Feature Proposal: Auto Agg method

Brief Description

Example API

pyjanitor
pyjanitor copied to clipboard