groupby apply is limited to 5000 rows
apologies if this counts as a duplicate of https://github.com/jmcarpenter2/swifter/issues/202
setting up the data:
data = {f'col{l}': [np.array([i, j, k, l]) for i in range(11) for j in range(31) for k in range(15)] for l in range(4)}
df = pd.DataFrame(data, index=index)
now this fails:
df.iloc[:5001].swifter.groupby(level=0, group_keys=False).apply(lambda x: x)
but succeeds with only 5000 rows.
File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:638](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:638), in GroupBy.apply(self, func, *args, **kwds)
[635](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=634) return self._obj_pd.groupby(self._by, axis=self._axis, **self._grpby_kwargs).apply(func, *args, **kwds)
[637](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=636) # Swifter logic can't accurately estimate groupby applies, so always parallelize
--> [638](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=637) return self._ray_apply(func, *args, **kwds)
File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:622](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:622), in GroupBy._ray_apply(self, func, *args, **kwds)
[619](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=618) def _ray_apply(self, func, *args, **kwds):
[620](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=619) import ray
--> [622](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=621) chunks = self._get_chunks()
[623](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=622) ray_submit_apply = partial(self._ray_submit_apply, chunks=chunks, func=func, *args, **kwds)
[624](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=623) apply_chunks = (
[625](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=624) self._ray_progress_apply(ray_submit_apply, len(chunks)) if self._progress_bar else ray_submit_apply()
[626](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=625) )
File [~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:591](https://file+.vscode-resource.vscode-cdn.net/Users/davidhinrichs/projects/backend/~/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py:591), in GroupBy._get_chunks(self)
[590](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=589) def _get_chunks(self):
--> [591](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=590) subset_df = self._obj_pd.index if self._grpby_index else self._obj_pd[self._by[0]]
[592](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=591) unique_groups = subset_df.unique()
[593](file:///Users/davidhinrichs/projects/backend/.venv/lib/python3.11/site-packages/swifter/swifter.py?line=592) n_splits = min(len(unique_groups), self._npartitions)
TypeError: 'NoneType' object is not subscriptable```
Any insight appreciated :)
Hey @davhin , thanks for raising this issue and providing a reproducible example. This is an oversight in my implementation of the groupby apply. I failed to incorporate the level parameter appropriately. I only ensured the by parameter worked. Really appreciative of you finding this. I will work on a patch shortly.
Oh, thank you so much! Glad the example was of service
Any updates on this? Found that this still applies with 1.4.0