tidypandas
tidypandas copied to clipboard
Error in slice_head when n is less than chunk size for the any group. Behavior different from dplyr.
In [1]: import tidypandas.tidy_accessor as tp
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"a":[1,1,1,2], "b": [1,2,3,4]})
In [4]: df.tp.slice_head(n=2, by="a")
Minimum group size is 1
/Users/a0r0qfj/py_envs/python3.10.7/lib/python3.10/site-packages/astroid/node_classes.py:94: DeprecationWarning: The 'astroid.node_classes' module is deprecated and will be replaced by 'astroid.nodes' in astroid 3.0.0
warnings.warn(
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.tp.slice_head(n=2, by="a")
File ~/py_envs/python3.10.7/lib/python3.10/site-packages/tidypandas/tidy_accessor.py:382, in tp.slice_head(self, n, prop, rounding_type, by)
375 def slice_head(self
376 , n = None
377 , prop = None
378 , rounding_type = "round"
379 , by = None
380 ):
381 tf = tidyframe(self._obj, copy = False, check = False)
--> 382 return tf.slice_head(n = n
383 , prop = prop
384 , rounding_type = rounding_type
385 , by = by
386 ).to_pandas(copy = False)
File ~/py_envs/python3.10.7/lib/python3.10/site-packages/tidypandas/tidyframe_class.py:4124, in tidyframe.slice_head(self, n, prop, rounding_type, by)
4122 if n > min_group_size:
4123 print("Minimum group size is ", min_group_size)
-> 4124 assert n <= min_group_size,\
4125 "arg 'n' should not exceed the size of any chunk after grouping"
4127 ro_name = _generate_new_string(cn)
4128 res = (self.group_modify(lambda x: x.slice(np.arange(n))
4129 , by = by
4130 , preserve_row_order = True
4131 , row_order_column_name = ro_name
4132 )
4133 )
AssertionError: arg 'n' should not exceed the size of any chunk after grouping
Same operation in R, doesn't throw an error. Instead it returns the chunk with size = min(size of the chunk, n)
> library(tidyverse)
> df = tibble(a=c(1,1,1,2), b=c(1,2,3,4))
> df %>% group_by(a) %>% slice_head(n=2) %>% ungroup()
# A tibble: 3 × 2
a b
<dbl> <dbl>
1 1 1
2 1 2
3 2 4
This was done intentionally.
Design question: If an user seeks say 5 rows per group and we cant provide it ... should we give an error stating it or silently provide what we can?
I think head should give what it can. That makes it easier when you are doing data exploration. Perhaps you can add an argument to change the behavior so that give error if the size if larger than what you ask. This would be useful inside functions so you know what to expect.