cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] `cats` argument does not behave correctly for `cudf.get_dummies`

Open isVoid opened this issue 2 years ago • 1 comments

Describe the bug cudf.get_dummies does not function correctly when specified with cats 🐱.

Steps/Code to reproduce bug

In [1]: import cudf
In [2]: df = cudf.DataFrame({'col': list('abcdef')})
In [4]: cudf.get_dummies(df, cats={"a": ['a', 'c', 'f']})
Out[4]: 
   col_a  col_b  col_c  col_d  col_e  col_f
0      1      0      0      0      0      0
1      0      1      0      0      0      0
2      0      0      1      0      0      0
3      0      0      0      1      0      0
4      0      0      0      0      1      0
5      0      0      0      0      0      1

Expected behavior Per documentation, get_dummies should only encode a, c, f. However, since pd.get_dummies does not support this argument. My suggestion is that we should remove it.

Additional Information Since libcudf one-hot-encoding API encodes the column in a single contiguous buffer. The size of the return buffer is limited by the maximum addressable column size of libcudf, std::numeric_limit<cudf::size_type>::max. The removal of this argument may prevent user from doing manual batching. Instead, Python API should gracefully handle this situation internally.

isVoid avatar Aug 02 '22 20:08 isVoid

cc @VibhuJawa for potential downstream influences.

isVoid avatar Aug 02 '22 20:08 isVoid

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 02 '22 17:09 github-actions[bot]