cudf icon indicating copy to clipboard operation
cudf copied to clipboard

DataFrame.pivot_table not supported in Cudf

Open nurmanmus opened this issue 1 year ago • 5 comments

Missing Pandas Feature Request A clear and concise summary of the pandas function(s) you'd like to be able run with cuDF. DataFrame.pivot_table not supported in Cudf

Profiler Output If you used the profiler in pandas accelerator mode, please provide the full output of your profiling report.

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Function                  ┃ GPU ncalls ┃ GPU cumtime ┃ GPU percall ┃ CPU ncalls ┃ CPU cumtime ┃ CPU percall ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ DataFrame.pivot_table     │ 0          │ 0.000       │ 0.000       │ 1          │ 0.076       │ 0.076       │
│ DataFrame.reset_index     │ 1          │ 0.003       │ 0.003       │ 0          │ 0.000       │ 0.000       │
│ merge                     │ 1          │ 1.164       │ 1.164       │ 0          │ 0.000       │ 0.000       │
│ DataFrame.drop_duplicates │ 1          │ 0.170       │ 0.170       │ 0          │ 0.000       │ 0.000       │
│ DataFrame                 │ 1          │ 0.000       │ 0.000       │ 0          │ 0.000       │ 0.000       │
│ DataFrame.__repr__        │ 1          │ 0.539       │ 0.539       │ 0          │ 0.000       │ 0.000       │
└───────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────────┴─────────────┘

Not all pandas operations ran on the GPU. The following functions required CPU fallback:

  • DataFrame.pivot_table

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

nurmanmus avatar Mar 09 '24 07:03 nurmanmus

@nurmanmus Thanks for this issue. Could you share the function parameters you passed? The pivot table function exists but does not support the full range of arguments on the GPU. See if you are using any of the options marked as unsupported in the cudf docs, and that will help us narrow down what features to prioritize: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.dataframe.pivot_table/#cudf.DataFrame.pivot_table

bdice avatar Mar 09 '24 12:03 bdice

Sure, here are the code and parameters:

# Create a pivot table with 'tag' as columns, and 'fp', 'ddate', 'qtrs' as index
pivot_df = filtered_df.pivot_table(index=['name', 'fp', 'ddate', 'qtrs'], columns='tag', values='value', aggfunc='first').reset_index()
pivot_df

nurmanmus avatar Mar 12 '24 22:03 nurmanmus

Thanks, unfortunately without a sample dataset, I can't yet reproduce this pivot table failure. Is it possible for you to share a more complete example that fails, that is, loading a dataset to produce filtered_df as a cudf DataFrame and then running the pivot-table command should produce an error (which is why you're seeing the fallback).

This is what I did, which does work:

import cudf
# make a trivial dataframe
df = cudf.DataFrame({"a": ["a", "a", "b", "c"], "b": [1, 1, 2, 3], "c": [1, 2, 3, 4], "d": [1, 4, 7, 8]})
pivoted = df.pivot_table(index=["a", "b"], columns="c", values="d", aggfunc="first")

My guess is that in your dataframe the value column is a list or struct column? And the "first" aggregation is unsupported.

wence- avatar Mar 13 '24 10:03 wence-

@nurmanmus any thoughts on the above question?

vyasr avatar May 17 '24 20:05 vyasr

Noted, I will get back to you on this.

nurmanmus avatar May 18 '24 10:05 nurmanmus

I'm going to close this thread for now, but please reopen if any further information surfaces. Thanks!

vyasr avatar Jan 30 '25 23:01 vyasr