datar
datar copied to clipboard
mean() with option `na_rm=False` does not work
Please, consider the MWE below:
from datar.all import *
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id': ['A']*2 + ['B']*2,
'date':['2020-01-01','2020-02-01']*2,
'value': [2,np.nan,3,3]
})
df
df_mean = (df
>> group_by(f.id)
>> summarize(
# value_np_nanmean = np.nanmean(f.value),
value_np_mean = np.mean(f.value),
value_datar_mean = mean(f.value, na_rm=False)
)
)
df_mean

In df_mean, the first observation of value_np_mean and value_datar_mean should be NAN instead of 2.
This is the same issue found in Pandas, which discards NAN / None observations automatically during calculations.
The only workaround I found is this: https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans/54106520
pandas ignores NAs anyway in a groupby -> agg chain:
>>> df.groupby('id').agg(np.mean)
value
id <float64>
A 2.0
B 3.0
>>> df.groupby('id').agg(np.nanmean)
value
id <float64>
A 2.0
B 3.0
Actually, the NAs in the first case should not be ignored, but pandas did that.
I think this is also related:
https://github.com/pandas-dev/pandas/issues/15675 https://github.com/pandas-dev/pandas/issues/15674 https://github.com/pandas-dev/pandas/issues/20824
The current solution for datar is that, don't try to optimize it using agg when na_rm is False.
Hey @pwwang , thanks for the very good feedback as always.
Yes, it is the standard behavior of Pandas, which is wrong and dreadful in my opinion...
Since datar uses Pandas under the hood, I suppose it will be difficult for you to solve this issue, right?
Python would be much better if someday in the future https://github.com/h2oai/datatable could replace Pandas as the default data library
It's fixed by https://github.com/pwwang/datar/commit/ba8b3e712a3dc4dfc9a0dae43e396aa94aa774e7, and will be released in the next version. It's just a matter of how we want pandas to do it. If we do it like:
>>> df.groupby('id').agg(value=('value', lambda x: mean(x)))
value
id <float64>
A NaN
B 3.0
But then we lost pandas' optimization on mean. With this fix, if people still want to take advantage of the optimization, one could do:
df >> group_by(f.id) >> summarise(m=mean(f.value, na_rm=True))
# since pandas loses NAs anyway
This needs to be documented, for sure.
For the datatable backend, I need to dive into it to see we can/need to replace pandas with it.
Great man! Thank you
Hey @pwwang , I believe this issue regressed.
In the latest datar version, this
import datar.all as d
from datar import f
import numpy as np
import pandas as pd
df = pd.DataFrame({
'id': ['A']*2 + ['B']*2,
'date':['2020-01-01','2020-02-01']*2,
'value': [2,np.nan,3,3]
})
df >> d.group_by(f.id) >> d.summarise(m=d.mean(f.value, na_rm=True))
returns:
TypeError: mean() got an unexpected keyword argument 'na_rm'
Also, this used to work:
df_mean = (df
>> d.group_by(f.id)
>> d.summarize(
value_np_nanmean = np.nanmean(f.value),
value_np_mean = np.mean(f.value),
)
)
But now it throws these errors, respectively:
ValueError: invalid __array_struct__
TypeError: GroupBy.mean() got an unexpected keyword argument 'axis'
It's all because GroupBy.mean() and alike don't have skipna argument and pandas won't keep NAs any way for groupby data.
See: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.mean.html
Let's say we have gf = df >> d.group_by(f.id). The NAs are ignored anyway even when we do gf.value.agg(np.nanmean) (ridicules, huh?)
In the old days, datar uses apply on groupby data, which makes it easier to customize but sacrifices performance. Now mean(f.value) is actually transformed into gf.value.agg('mean') to maintain the performance, with the sacrifce of functionality (keeping NAs, for example)
Now, with datar v0.15.3, datar-pandas v0.5.3, value_np_nanmean = np.nanmean(f.value) should work, but keep in mind that it is implemented with apply, performance may be compromised.
We should be able to support na_rm argument, and just need to expand:
func_bootstrap(mean, func=np.mean, kind="agg")
https://github.com/pwwang/datar-pandas/blob/779a272a15c0d82e37e9025312f45836ccb10210/datar_pandas/api/base/arithm.py#L63
into functions on different types of objects (i.e. Series, SeriousGroupBy) (The func_bootstrap just does it automatically).
I am just short of time to do that.
Pandas3 will require pyarrow as the backend, which supports Nullable datatypes. I believe this won't be a problem then.
The
NAs are ignored anyway even when we dogf.value.agg(np.nanmean)(ridicules, huh?)
@pwwang , Yes, man, this is ridiculous, another reason why I hate Pandas. tidyverse design is on a completely different level, that's a work of art software, and that's why a package like datar is so important in python.
I am just short of time to do that.
No problem.
Pandas3 will require
pyarrowas the backend, which supports Nullable datatypes. I believe this won't be a problem then.
Yes, let's see if it will be finally solved.
Thank you