[BUG] dask_cudf pivot_table function is broken: TypeError: StringIndex object is not iterable.
Describe the bug Pivot_table fails on a dask_cudf dataframe due to an unimplemented Index iteration function:
Steps/Code to reproduce bug
ddf = dask_cudf.from_cudf(cudf.DataFrame(
data={
"A": ["foo", "bar", "bar"],
"B": ["one", "two", "one"],
"C": [1, 2, 3]
}
), npartitions=1)
ddf = ddf.categorize("B")
ddf.pivot_table(index="A", columns="B", values="C")
Error:
TypeError Traceback (most recent call last)
Cell In[3], line 9
1 ddf = dask_cudf.from_cudf(cudf.DataFrame(
2 data={
3 "A": ["foo", "bar", "bar"],
(...)
6 }
7 ), npartitions=1)
8 ddf = ddf.categorize("B")
----> 9 ddf.pivot_table(index="A", columns="B", values="C")
File lib/python3.10/site-packages/dask/dataframe/core.py:6373, in DataFrame.pivot_table(self, index, columns, values, aggfunc)
6352 """
6353 Create a spreadsheet-style pivot table as a DataFrame. Target ``columns``
6354 must have category dtype to infer result's ``columns``.
(...)
6369 table : DataFrame
6370 """
6371 from dask.dataframe.reshape import pivot_table
-> 6373 return pivot_table(
6374 self, index=index, columns=columns, values=values, aggfunc=aggfunc
6375 )
File lib/python3.10/site-packages/dask/dataframe/reshape.py:233, in pivot_table(df, index, columns, values, aggfunc)
226 raise ValueError(
227 "aggfunc must be either " + ", ".join(f"'{x}'" for x in available_aggfuncs)
228 )
230 # _emulate can't work for empty data
231 # the result must have CategoricalIndex columns
--> 233 columns_contents = pd.CategoricalIndex(df[columns].cat.categories, name=columns)
234 if is_scalar(values):
235 new_columns = columns_contents
File lib/python3.10/site-packages/pandas/core/indexes/category.py:234, in CategoricalIndex.__new__(cls, data, categories, ordered, dtype, copy, name)
231 if is_scalar(data):
232 raise cls._scalar_data_error(data)
--> 234 data = Categorical(
235 data, categories=categories, ordered=ordered, dtype=dtype, copy=copy
236 )
238 return cls._simple_new(data, name=name)
File lib/python3.10/site-packages/pandas/core/arrays/categorical.py:410, in Categorical.__init__(self, values, categories, ordered, dtype, fastpath, copy)
408 dtype = CategoricalDtype(values.categories, dtype.ordered)
409 elif not isinstance(values, (ABCIndex, ABCSeries, ExtensionArray)):
--> 410 values = com.convert_to_list_like(values)
411 if isinstance(values, list) and len(values) == 0:
412 # By convention, empty lists result in object dtype:
413 values = np.array([], dtype=object)
File lib/python3.10/site-packages/pandas/core/common.py:541, in convert_to_list_like(values)
539 return values
540 elif isinstance(values, abc.Iterable) and not isinstance(values, str):
--> 541 return list(values)
543 return [values]
File lib/python3.10/site-packages/cudf/utils/utils.py:242, in NotIterable.__iter__(self)
235 def __iter__(self):
236 """
237 Iteration is unsupported.
238
239 See :ref:`iteration <pandas-comparison/iteration>` for more
240 information.
241 """
--> 242 raise TypeError(
243 f"{self.__class__.__name__} object is not iterable. "
244 f"Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` "
245 f"if you wish to iterate over the values."
246 )
TypeError: StringIndex object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.
Expected behavior Pivot_table succeeds as documented.
Environment overview (please complete the following information) Installed cuDF using pip, using the stable release:
pip install \
--extra-index-url=https://pypi.nvidia.com \
cudf-cu12==23.12.* dask-cudf-cu12==23.12.* cuml-cu12==23.12.* \
cugraph-cu12==23.12.* cuspatial-cu12==23.12.* cuproj-cu12==23.12.* \
cuxfilter-cu12==23.12.* cucim-cu12==23.12.* pylibraft-cu12==23.12.* \
raft-dask-cu12==23.12.*
Environment details
<details><summary>Click here to see environment details</summary><pre>
**git***
fatal: your current branch 'master' does not have any commits yet
**git submodules***
***OS Information***
NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"
Red Hat Enterprise Linux release 8.8 (Ootpa)
Red Hat Enterprise Linux release 8.8 (Ootpa)
Linux c1000a-s23.ufhpc 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Thu Aug 31 10:29:22 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
***GPU Information***
Tue Jan 30 11:09:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 25C P0 56W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0F:00.0 Off | 0 |
| N/A 26C P0 57W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:47:00.0 Off | 0 |
| N/A 24C P0 54W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4E:00.0 Off | 0 |
| N/A 24C P0 56W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:87:00.0 Off | 0 |
| N/A 29C P0 67W / 400W | 583MiB / 81920MiB | 40% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:90:00.0 Off | 0 |
| N/A 45C P0 177W / 400W | 775MiB / 81920MiB | 94% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:B7:00.0 Off | 0 |
| N/A 60C P0 338W / 400W | 76523MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:BD:00.0 Off | 0 |
| N/A 28C P0 54W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 4 N/A N/A 2669759 C python3 570MiB |
| 5 N/A N/A 1903237 C pmemd.cuda_SPFP 762MiB |
| 6 N/A N/A 1446394 C python 76510MiB |
+---------------------------------------------------------------------------------------+
***CPU***
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 3386.055
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.84
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
NUMA node2 CPU(s): 32-47
NUMA node3 CPU(s): 48-63
NUMA node4 CPU(s): 64-79
NUMA node5 CPU(s): 80-95
NUMA node6 CPU(s): 96-111
NUMA node7 CPU(s): 112-127
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
***CMake***
/apps/jupyter/6.5.4/bin/cmake
./print_env.sh: /apps/jupyter/6.5.4/bin/cmake: /apps/jupyter/6.5.4/bin/python3.11: bad interpreter: No such file or directory
***g++***
/usr/bin/g++
g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
***nvcc***
/apps/compilers/cuda/12.2.2/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
***Python***
/blue/ptighe-rapidsai/pvnick/rapids-test/rapids-test/bin/python
Python 3.10.12
***Environment Variables***
PATH : /apps/compilers/cuda/12.2.2/bin:/blue/ptighe-rapidsai/pvnick/rapids-test/rapids-test/bin:/opt/slurm/bin:/usr/local/cuda/bin:/opt/bin:/apps/jupyter/6.5.4/bin:/apps/ufrc/ufhpc/bin:/apps/git/2.30.1/bin:/home/pvnick/.local/bin:/home/pvnick/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/bin
LD_LIBRARY_PATH : /apps/compilers/cuda/12.2.2/lib64:/opt/slurm/lib64::
NUMBAPRO_NVVM :
NUMBAPRO_LIBDEVICE :
CONDA_PREFIX :
PYTHON_PATH :
conda not found
***pip packages***
/blue/ptighe-rapidsai/pvnick/rapids-test/rapids-test/bin/pip
Package Version
------------------------- ---------------
aiohttp 3.9.3
aiosignal 1.3.1
anyio 4.2.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
async-lru 2.0.4
async-timeout 4.0.3
attrs 23.2.0
Babel 2.14.0
beautifulsoup4 4.12.3
bleach 6.1.0
bokeh 3.3.4
cachetools 5.3.2
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpickle 3.0.0
colorcet 3.0.1
comm 0.2.1
contourpy 1.2.0
cucim-cu12 23.12.1
cuda-python 12.3.0
cudf-cu12 23.12.1
cugraph-cu12 23.12.0
cuml-cu12 23.12.0
cuproj-cu12 23.12.1
cupy-cuda12x 13.0.0
cuspatial-cu12 23.12.1
cuxfilter-cu12 23.12.0
dask 2023.11.0
dask-cuda 23.12.0
dask-cudf-cu12 23.12.0
datashader 0.16.0
debugpy 1.8.0
decorator 5.1.1
defusedxml 0.7.1
distributed 2023.11.0
exceptiongroup 1.2.0
executing 2.0.1
fastjsonschema 2.19.1
fastrlock 0.8.2
fiona 1.9.5
fqdn 1.5.1
frozenlist 1.4.1
fsspec 2023.12.2
geopandas 0.14.2
holoviews 1.18.1
idna 3.6
imageio 2.33.1
importlib-metadata 7.0.1
ipykernel 6.29.0
ipython 8.20.0
ipywidgets 8.1.1
isoduration 20.11.0
jedi 0.19.1
Jinja2 3.1.3
joblib 1.3.2
json5 0.9.14
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter 1.0.0
jupyter_client 8.6.0
jupyter-console 6.6.3
jupyter_core 5.7.1
jupyter-events 0.9.0
jupyter-lsp 2.2.2
jupyter_server 2.12.5
jupyter_server_proxy 4.1.0
jupyter_server_terminals 0.5.2
jupyterlab 4.0.11
jupyterlab_pygments 0.3.0
jupyterlab_server 2.25.2
jupyterlab-widgets 3.0.9
lazy_loader 0.3
linkify-it-py 2.0.2
llvmlite 0.40.1
locket 1.0.0
Markdown 3.5.2
markdown-it-py 3.0.0
MarkupSafe 2.1.4
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
mistune 3.0.2
msgpack 1.0.7
multidict 6.0.4
multipledispatch 1.0.0
nbclient 0.9.0
nbconvert 7.14.2
nbformat 5.9.2
nest-asyncio 1.6.0
networkx 3.2.1
notebook 7.0.7
notebook_shim 0.2.3
numba 0.57.1
numpy 1.24.4
nvtx 0.2.8
overrides 7.7.0
packaging 23.2
pandas 1.5.3
pandocfilters 1.5.1
panel 1.3.8
param 2.0.2
parso 0.8.3
partd 1.4.1
pexpect 4.9.0
pillow 10.2.0
pip 23.0.1
platformdirs 4.1.0
prometheus-client 0.19.0
prompt-toolkit 3.0.43
protobuf 4.25.2
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 14.0.2
pycparser 2.21
pyct 0.5.0
Pygments 2.17.2
pylibcugraph-cu12 23.12.0
pylibraft-cu12 23.12.0
pynvml 11.4.1
pyproj 3.6.1
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2023.4
pyviz_comms 3.0.1
PyWavelets 1.5.0
PyYAML 6.0.1
pyzmq 25.1.2
qtconsole 5.5.1
QtPy 2.4.1
raft-dask-cu12 23.12.0
rapids-dask-dependency 23.12.1
referencing 0.33.0
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 13.7.0
rmm-cu12 23.12.0
rpds-py 0.17.1
scikit-image 0.21.0
scipy 1.12.0
Send2Trash 1.8.2
setuptools 65.5.0
shapely 2.0.2
simpervisor 1.0.0
six 1.16.0
sniffio 1.3.0
sortedcontainers 2.4.0
soupsieve 2.5
stack-data 0.6.3
tblib 3.0.0
terminado 0.18.0
tifffile 2024.1.30
tinycss2 1.2.1
tomli 2.0.1
toolz 0.12.1
tornado 6.4
tqdm 4.66.1
traitlets 5.14.1
treelite 3.9.1
treelite-runtime 3.9.1
types-python-dateutil 2.8.19.20240106
typing_extensions 4.9.0
uc-micro-py 1.0.2
ucx-py-cu12 0.35.0
uri-template 1.3.0
urllib3 2.1.0
wcwidth 0.2.13
webcolors 1.13
webencodings 0.5.1
websocket-client 1.7.0
widgetsnbextension 4.0.9
xarray 2024.1.1
xyzservices 2023.10.1
yarl 1.9.4
zict 3.0.0
zipp 3.17.0
[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: pip install --upgrade pip
</pre></details>
Hi @pvnick , thanks for the report. We'll investigate and follow up on this issue.
Related: https://github.com/rapidsai/cudf/issues/15179
For context, the decision to disallow iteration over GPU objects is intentional -- it keeps users from accidentally triggering many host-device transfers (e.g. in a for loop) that are highly inefficient. This is problematic in some cases when column names are part of an object on the GPU that needs to be iterated over. The solution to this will likely require some code change in dask-cudf to convert the StringIndex into a type that is supported on the host.
While it is inefficient to iterate row-wise over the dataframe, it's pretty difficult to adapt all of dask-dataframe to do something different based on cudf/pandas. Note we can't really do this in dask-cudf without monkey-patching and/or reimplementing dask.dataframe.pivot_table.
I'm not sure the iteration is that inefficient, if we implemented it as (for a stringindex)
def __iter__(self):
return iter(self.to_pandas())
There's only one device-to-host copy
I am leaning towards the same view as Lawrence here. We've had these disabled code paths for a long time, and while I understand the rationale I think at this point I'm OK with relaxing this behavior. Especially in light of cudf.pandas or dask integration, disabling a code path in a way that breaks those weights seems less favorable than it may once have.
I’m okay with that proposal. My comments above were primarily to establish historical context — I am alright with changing the behavior to solve compatibility issues.
This was marked as closed by https://github.com/rapidsai/cudf/pull/16786, but reading through the comments here I think the consensus was to make cudf.Index iterable:
In [10]: list(cudf.Index(['a', 'b']))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[10], line 1
----> 1 list(cudf.Index(['a', 'b']))
File ~/cudf/python/cudf/cudf/utils/utils.py:245, in NotIterable.__iter__(self)
238 def __iter__(self):
239 """
240 Iteration is unsupported.
241
242 See :ref:`iteration <pandas-comparison/iteration>` for more
243 information.
244 """
--> 245 raise TypeError(
246 f"{self.__class__.__name__} object is not iterable. "
247 f"Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` "
248 f"if you wish to iterate over the values."
249 )
TypeError: Index object is not iterable. Consider using `.to_arrow()`, `.to_pandas()` or `.values_host` if you wish to iterate over the values.
#16786 was fixing a related, but not identical issue.
I'll reopen this, but feel free to close it if it was intended to be closed.