danfojs icon indicating copy to clipboard operation
danfojs copied to clipboard

NaN values don't get skipped on groupby aggregations.

Open kitsunde opened this issue 2 years ago • 4 comments

Hello, maybe I'm missing something but groupby aggregations behave inconsistently with DataFrame aggregations, which is a challenge since setting NaN values to a default like 0 behaves differently depending on the aggregations.

I will explore using .apply to work around this as mentioned in https://github.com/javascriptdata/danfojs/issues/187#issuecomment-827531989

To Reproduce

const testDf = new dfd.DataFrame({ group: ['A', 'A'], val1: [1, NaN], val2: [2, 3] });
testDf.groupby(['group']).mean().print();

results in val1_mean being NaN. I.e.

╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
║            │ group             │ val1_mean         │ val2_mean         ║
╟────────────┼───────────────────┼───────────────────┼───────────────────╢
║ 0          │ A                 │ NaN               │ 2.5               ║
╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

Expected behavior

To aggregate like:

testDf.loc({ columns: ['val1', 'val2'] }).mean({ axis: 0 }).print();

i.e.

╔══════╤═════╗
║ val1 │ 1   ║
╟──────┼─────╢
║ val2 │ 2.5 ║
╚══════╧═════╝

kitsunde avatar Mar 09 '22 18:03 kitsunde

The current behaviour I think is fine. Pandas does give an dropna option. It could be implemented, so the behaviour could be extended to this:

testDf.groupby(['group'], { dropna: true }).mean().print();
// ╔════════════╤═══════════════════╤═══════════════════╤═══════════════════╗
// ║            │ group             │ val1_mean         │ val2_mean         ║
// ╟────────────┼───────────────────┼───────────────────┼───────────────────╢
// ║ 0          │ A                 │ 1.0               │ 2.5               ║
// ╚════════════╧═══════════════════╧═══════════════════╧═══════════════════╝

I don't know is something like that would be what you need.

Anyway, why do you say that groupby aggregations behave inconsistently? Can you give another example with a inconsistent result?

igonro avatar Mar 10 '22 06:03 igonro

Ah sorry, I copy-pasted my code poorly into the expected example, I've updated it. What I mean is that.

testDf.loc({ columns: ['val1', 'val2'] }).mean({ axis: 0 }).print();

Will ignore NaN but

testDf.groupby(['group']).mean().print();

will not. The challenge is that if I need to do something like:

agg({
  installs: ['sum'],
  value: ['mean', 'sum']
});

Different columns can have NaN on different rows, and it seems like dropna() will drop rows or columns if there's a NaN in any position, while to be able to use agg it would need to cover the values only on the column being aggregated.

The pandas option seem different from what I'm suggesting. That will drop columns and rows, but I'm just specifically talking about the aggregation behaviour of NaN values. If installs is NaN, but value is not. I would still want to aggregate on the value of that row.

Does this make sense?

kitsunde avatar Mar 10 '22 06:03 kitsunde

In pandas:

import pandas as pd

d = {'group': ['A', 'A'], 'col1': [1, None], 'col2': [2, 3]}
df = pd.DataFrame(data=d)
df.groupby(['group']).mean()
df.groupby(['group']).agg({
  'col1': ['mean'],
  'col2': ['mean']
})

Both mean and agg here skips None values by default.

Repl: https://replit.com/@kitsunde/pandas-none-handling-in-groupby

kitsunde avatar Mar 10 '22 07:03 kitsunde