swifter icon indicating copy to clipboard operation
swifter copied to clipboard

groupby apply is limited to 5000 rows

Open xstraven opened this issue 2 years ago • 3 comments

apologies if this counts as a duplicate of https://github.com/jmcarpenter2/swifter/issues/202

setting up the data:

data = {f'col{l}': [np.array([i, j, k, l]) for i in range(11) for j in range(31) for k in range(15)] for l in range(4)}
df = pd.DataFrame(data, index=index)

now this fails: df.iloc[:5001].swifter.groupby(level=0, group_keys=False).apply(lambda x: x) but succeeds with only 5000 rows.


File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:638](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:638), in GroupBy.apply(self, func, *args, **kwds)
    [635](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=634)     return self._obj_pd.groupby(self._by, axis=self._axis, **self._grpby_kwargs).apply(func, *args, **kwds)
    [637](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=636) # Swifter logic can't accurately estimate groupby applies, so always parallelize
--> [638](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=637) return self._ray_apply(func, *args, **kwds)

File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:622](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:622), in GroupBy._ray_apply(self, func, *args, **kwds)
    [619](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=618) def _ray_apply(self, func, *args, **kwds):
    [620](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=619)     import ray
--> [622](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=621)     chunks = self._get_chunks()
    [623](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=622)     ray_submit_apply = partial(self._ray_submit_apply, chunks=chunks, func=func, *args, **kwds)
    [624](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=623)     apply_chunks = (
    [625](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=624)         self._ray_progress_apply(ray_submit_apply, len(chunks)) if self._progress_bar else ray_submit_apply()
    [626](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=625)     )

File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:591](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:591), in GroupBy._get_chunks(self)
    [590](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=589) def _get_chunks(self):
--> [591](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=590)     subset_df = self._obj_pd.index if self._grpby_index else self._obj_pd[self._by[0]]
    [592](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=591)     unique_groups = subset_df.unique()
    [593](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=592)     n_splits = min(len(unique_groups), self._npartitions)

TypeError: 'NoneType' object is not subscriptable```

Any insight appreciated :)

xstraven avatar Jul 11 '23 08:07 xstraven

Hey @davhin , thanks for raising this issue and providing a reproducible example. This is an oversight in my implementation of the groupby apply. I failed to incorporate the level parameter appropriately. I only ensured the by parameter worked. Really appreciative of you finding this. I will work on a patch shortly.

jmcarpenter2 avatar Jul 20 '23 16:07 jmcarpenter2

Oh, thank you so much! Glad the example was of service

xstraven avatar Jul 20 '23 19:07 xstraven

Any updates on this? Found that this still applies with 1.4.0

KeremAslan avatar Aug 19 '24 11:08 KeremAslan