pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: `pd.Grouper` creates empty groups (and in doing so is inconsistent with `groupby`) with `pd.DatetimeIndex`

Open bollard opened this issue 1 year ago • 14 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd


# note, does not include today
day_offset = [-2, -1, 1, 2]
timestamps = [pd.Timestamp.now() + pd.Timedelta(days=d) for d in day_offset]

df_ts = pd.DataFrame(
    data={"offset": day_offset},
    index=pd.Index(timestamps, name="timestamp")
)

print("\n===============")
print("{}\n".format(df_ts))

# want to group by the day
grouper = pd.Grouper(level="timestamp", freq="D")
grouped = df_ts.groupby(grouper)
print("[Grouper] Groups: {}, Indicies: {}\n".format(len(grouped.groups), len(grouped.indices)))

for key, df_group in grouped:
    print("{}: {}".format(key, len(df_group)))
    
    if key.date() == pd.Timestamp.now().date():
        print("...?") # unexpected!

df_date = pd.DataFrame(
    data={"offset": day_offset},
    index=pd.Index(df_ts.index.date, name='date')
)

print("\n===============")
print("{}\n".format(df_date))

grouped = df_date.groupby('date')
print("[Groupby] Groups: {}, Indicies: {}\n".format(len(grouped.groups), len(grouped.indices)))

for key, df_group in grouped:
    print("{}: {}".format(key, len(df_group)))

Issue Description

Hello,

In the example above, I am trying to group datetimes to dates, using the pd.Grouper(..., freq="D"), however this creates a key for a date which doesn't exist in the data. This behaviour is different to using groupby on dates directly (which does not create a missing key). Also note this creates a mismatch in the size of the .groups and the .indices (as the latter does not include the empty reference)

This creates unexpected behaviour when you try to loop over the groups later (and have more groups than you'd expect).

This is also not particularly easy to handle on the user end (if we do not wish to reconstruct the groupby object). Instead I have resorted to looping over the grouped.indices.keys() and the using get_group to avoid the empty group.

Expected Behavior

The empty group would not be created with the pd.Grouper

Installed Versions

commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6 python : 3.10.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United Kingdom.1252

pandas : 1.4.3 numpy : 1.23.1 pytz : 2022.1 dateutil : 2.8.2 setuptools : 61.2.0 pip : 22.1.2 Cython : None pytest : 6.2.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.4.0 pandas_datareader: None bs4 : 4.11.1 bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None markupsafe : 2.1.1 matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 8.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

bollard avatar Aug 04 '22 11:08 bollard

I believe the issue is 2-fold:

  • using pd.Grouper(..., freq="D") creates an empty group when a date is missing - this I think is by design because we set freq="D". Using pd.Grouper(..., freq=None) doesn't lead to this issue. Using freq='D' is more like a resampling operation, for eg df_ts.resample('D').sum(), which will also yield the empty date.
  • the second issue is the mismatch between .groups and .indices - this too I believe is not an issue - this is in line with the same operations on .resample('1D').

pratyushsharan avatar Aug 08 '22 07:08 pratyushsharan

Thanks for your comments and taking a look at the issue.

On the first point, I think it is at least unexpected that pd.Grouper(..., freq="D") creates an empty group on a missing date. I did not see any mention of this behaviour in the documentation. Further, using pd.Grouper(..., freq=None) does not solve the issue, as I want to group by the date component.

I can't comment on the design of .resample('1D'), but having just looked at the documentation I again see no mention of "missing" or "empty", so again at the very least I find the behaviour somewhat surprising.

I also think it is quite surprising to the user that the behaviour of pd.Grouper(..., freq="D") and .groupby("date") are different when both are resampling operations (especially when the first example in the docs for pd.Grouper starts with "Syntactic sugar for df.groupby('A')")

bollard avatar Aug 08 '22 07:08 bollard

Further, using pd.Grouper(..., freq=None) does not solve the issue, as I want to group by the date component.

Can you please elaborate? pd.Grouper(..., freq=None) will group by whatever ... corresponds to, in this case, the timesamps/dates. Are you suggesting that if the indices are of a different frequency (say, tick by tick data) and we want to groupby days, then pd.Grouper(..., freq='D') will include any possible empty days? If yes, I believe this is till in line with this

Syntactic sugar for df.groupby('A')

This still holds true if freq=None.

pratyushsharan avatar Aug 08 '22 08:08 pratyushsharan

Sure, apologies if my point here isn't clear.

In the above code snippet, I have pd.DataFrame with timestamp indices (df_ts), and another one with date indices (df_date). In both cases I am trying to aggregate the data (the day_offset column) by the date (i.e. ignoring any time component).

In the df_ts example I use pd.Grouper(..., freq="D") to group by the date, in the df_date example, I use ,groupby("date") directly (where .dateis simplydf_ts.index.date`).

My expectation is that these operations should produce the same result, as logically in both cases I am doing the same thing - grouping the data by the date. However, the example shows these in fact produce different results

bollard avatar Aug 08 '22 08:08 bollard

I see - my point was that you can replicate groupby simply by setting freq=None in Grouper. Setting freq='D' results in a more resampling like approach. I believe this is more a documentation issue rather than a bug.

pratyushsharan avatar Aug 08 '22 08:08 pratyushsharan

Yes, you can replicate an empty .groupby() with an empty pd.Grouper(..., freq=None) but this interoperability fails when you actually try and do something useful...

It still feels like a bug to me (as surely both .groupby and pd.Grouper are resampling?) but agreed at the very minimum the documentation is not clear

bollard avatar Aug 08 '22 08:08 bollard

Can you give an example of where pd.Grouper(..., freq=None) would fail to match with .groupby()?

as surely both .groupby and pd.Grouper are resampling?

I wouldn't equate both of these to resampling. Resampling is more a frequency conversion operation, as documented here.

pratyushsharan avatar Aug 08 '22 09:08 pratyushsharan

By "fails when you actually try and so something useful", I mean pd.Grouper(..., freq="D") and .groupby("date"). To me it seems strange that two functions are understood to operate the same way under freq=None, but then differently when freq != None. To me this is the essence of the issue

bollard avatar Aug 08 '22 09:08 bollard

Maybe we should just add in documentation that setting freq != None would result in a more resampling like approach.

pratyushsharan avatar Aug 08 '22 09:08 pratyushsharan

Sure, if that is the expected behaviour then I think that should be

  • Documented (complete with exactly what a "resampling like approach" means for the result)
  • Documented in all places where that behaviour is expected (pd.Grouper, groupby, resample etc. as appropriate)

bollard avatar Aug 08 '22 09:08 bollard

@jreback any thoughts if this should be treated as a bug or a documentation issue (enhancement)?

pratyushsharan avatar Aug 08 '22 09:08 pratyushsharan

I'm running into this too. I think a concrete example might help to set expectations. Take the following data in data.csv (yoinked from a hacker news data set)

id,by,author,time,time_ts,text,parent,deleted,dead,ranking
9734136,,,1434565400,2015-06-17 18:23:20.000000 UTC,,9733698,true,,0
4921158,,,1355496966,2012-12-14 14:56:06.000000 UTC,,4921100,true,,0

I can group this dataset (that consists of two lines) using the following snippet.

import pandas as pd

df = pd.read_csv('./data.csv')

date_col = 'time_ts'

df[date_col] = pd.to_datetime(df[date_col])
grouped = df.set_index(date_col).groupby(pd.Grouper(freq='D'))

for date_group, dataframe in grouped:
    if len(dataframe) == 0:
        print(f'Empty group for {date_group}')
        continue
    else:
      print(f'Data for {date_group} : {len(dataframe)}')

print(f'result count {len(grouped)}')

There will be 916 results because every day is included between the two days that actually exist. All but 2 of those 916 results will just be empty data frames That is definitely not what I expected to happen.

I'm not sure what was meant by a "more resampling approach was used", but the documentation definitely makes it seem like this will just group the data by date and it doesn't mention creating every possible date as well. But even if it did mention creating every possible date, why would anyone want that? I must be very confused about the purpose of the API if that behavior is desirable. At best I could see it being a no-op for cases where you just use aggregation functions and the math happens to work out such that this wasn't an issue.

naddeoa avatar Nov 20 '22 06:11 naddeoa

I've read through this and issue and I agree with @pratyushsharan that this works as intended in that is works similar to resample, thought I can also see a case for not getting empty groups out of this operations.

So this is not a bug, but I could see this a a request for a change in the API and/or making the docs clearer. My suggestion is to either rename this issue to the appropriate name or close this issue and make new one specifying what is desired.

topper-123 avatar May 08 '23 18:05 topper-123