cudf
cudf copied to clipboard
[BUG] `cats` argument does not behave correctly for `cudf.get_dummies`
Describe the bug
cudf.get_dummies
does not function correctly when specified with cats
🐱.
Steps/Code to reproduce bug
In [1]: import cudf
In [2]: df = cudf.DataFrame({'col': list('abcdef')})
In [4]: cudf.get_dummies(df, cats={"a": ['a', 'c', 'f']})
Out[4]:
col_a col_b col_c col_d col_e col_f
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 1 0
5 0 0 0 0 0 1
Expected behavior
Per documentation, get_dummies
should only encode a, c, f
. However, since pd.get_dummies
does not support this argument. My suggestion is that we should remove it.
Additional Information
Since libcudf one-hot-encoding API encodes the column in a single contiguous buffer. The size of the return buffer is limited by the maximum addressable column size of libcudf, std::numeric_limit<cudf::size_type>::max
. The removal of this argument may prevent user from doing manual batching. Instead, Python API should gracefully handle this situation internally.
cc @VibhuJawa for potential downstream influences.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.