datar icon indicating copy to clipboard operation
datar copied to clipboard

mean() with option `na_rm=False` does not work

Open GitHunter0 opened this issue 4 years ago • 8 comments

Please, consider the MWE below:

from datar.all import *
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 
df

df_mean = (df 
    >> group_by(f.id)
    >> summarize(
        # value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
        value_datar_mean = mean(f.value, na_rm=False)
    )
)
df_mean 

image

In df_mean, the first observation of value_np_mean and value_datar_mean should be NAN instead of 2. This is the same issue found in Pandas, which discards NAN / None observations automatically during calculations. The only workaround I found is this: https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans/54106520

GitHunter0 avatar Oct 04 '21 21:10 GitHunter0

pandas ignores NAs anyway in a groupby -> agg chain:

>>> df.groupby('id').agg(np.mean)
       value
            
id <float64>
A        2.0
B        3.0
>>> df.groupby('id').agg(np.nanmean)
       value
            
id <float64>
A        2.0
B        3.0

Actually, the NAs in the first case should not be ignored, but pandas did that.

I think this is also related:

https://github.com/pandas-dev/pandas/issues/15675 https://github.com/pandas-dev/pandas/issues/15674 https://github.com/pandas-dev/pandas/issues/20824

The current solution for datar is that, don't try to optimize it using agg when na_rm is False.

pwwang avatar Oct 04 '21 21:10 pwwang

Hey @pwwang , thanks for the very good feedback as always.

Yes, it is the standard behavior of Pandas, which is wrong and dreadful in my opinion...

Since datar uses Pandas under the hood, I suppose it will be difficult for you to solve this issue, right?

Python would be much better if someday in the future https://github.com/h2oai/datatable could replace Pandas as the default data library

GitHunter0 avatar Oct 04 '21 22:10 GitHunter0

It's fixed by https://github.com/pwwang/datar/commit/ba8b3e712a3dc4dfc9a0dae43e396aa94aa774e7, and will be released in the next version. It's just a matter of how we want pandas to do it. If we do it like:

>>> df.groupby('id').agg(value=('value', lambda x: mean(x)))
       value
            
id <float64>
A        NaN
B        3.0

But then we lost pandas' optimization on mean. With this fix, if people still want to take advantage of the optimization, one could do:

df >> group_by(f.id) >> summarise(m=mean(f.value, na_rm=True))
# since pandas loses NAs anyway

This needs to be documented, for sure.

For the datatable backend, I need to dive into it to see we can/need to replace pandas with it.

pwwang avatar Oct 04 '21 23:10 pwwang

Great man! Thank you

GitHunter0 avatar Oct 04 '21 23:10 GitHunter0

Hey @pwwang , I believe this issue regressed.

In the latest datar version, this

import datar.all as d
from datar import f
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 

df >> d.group_by(f.id) >> d.summarise(m=d.mean(f.value, na_rm=True))

returns:

TypeError: mean() got an unexpected keyword argument 'na_rm'

Also, this used to work:

df_mean = (df 
    >> d.group_by(f.id)
    >> d.summarize(
        value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
    )
)

But now it throws these errors, respectively: ValueError: invalid __array_struct__ TypeError: GroupBy.mean() got an unexpected keyword argument 'axis'

GitHunter0 avatar Oct 10 '23 01:10 GitHunter0

It's all because GroupBy.mean() and alike don't have skipna argument and pandas won't keep NAs any way for groupby data.

See: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html

Let's say we have gf = df >> d.group_by(f.id). The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)

In the old days, datar uses apply on groupby data, which makes it easier to customize but sacrifices performance. Now mean(f.value) is actually transformed into gf.value.agg('mean') to maintain the performance, with the sacrifce of functionality (keeping NAs, for example)

Now, with datar v0.15.3, datar-pandas v0.5.3, value_np_nanmean = np.nanmean(f.value) should work, but keep in mind that it is implemented with apply, performance may be compromised.

pwwang avatar Oct 10 '23 22:10 pwwang

We should be able to support na_rm argument, and just need to expand:

func_bootstrap(mean, func=np.mean, kind="agg")

https://github.com/pwwang/datar-pandas/blob/779a272a15c0d82e37e9025312f45836ccb10210/datar_pandas/api/base/arithm.py#L63

into functions on different types of objects (i.e. Series, SeriousGroupBy) (The func_bootstrap just does it automatically).

I am just short of time to do that.

Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.

pwwang avatar Oct 10 '23 22:10 pwwang

The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)

@pwwang , Yes, man, this is ridiculous, another reason why I hate Pandas. tidyverse design is on a completely different level, that's a work of art software, and that's why a package like datar is so important in python.

I am just short of time to do that.

No problem.

Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.

Yes, let's see if it will be finally solved.

Thank you

GitHunter0 avatar Oct 11 '23 02:10 GitHunter0