NVTabular icon indicating copy to clipboard operation
NVTabular copied to clipboard

[QST] NVTabular function is not supported for this dtype: size

Open LoMarrujo opened this issue 1 year ago • 9 comments

I tried running NVTabular code related to this and this, but I could not get past the line of code with the Workflow.

The error is:

File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate TypeError: function is not supported for this dtype: size

which occurs after calling Categorify.

Is there something I need to check in order to get NVTabular working? Any additional information from me to solve this issue?

Thanks!

LoMarrujo avatar Jul 08 '24 23:07 LoMarrujo

I have been struggling with the same exact issue for the last few days. Example code that i wrote while trying to debug:

import pandas as pd
import nvtabular as nvt
from nvtabular import ops
import cudf

# Sample Data
data = {
    'user_id': [16908, 16908, 16908, 16908, 16908],
    'item_id': [174, 78, 94, 174, 78],
    'timestamp': [
        '2024-01-03 14:49:27',
        '2024-01-03 15:33:31',
        '2024-01-03 16:01:57',
        '2024-01-04 18:57:33',
        '2024-01-04 18:59:41'
    ],
    'event_type': [
        'example1',
        'example2',
        'example13',
        'example4',
        'example5'
    ]
}
df = pd.DataFrame(data)
df['user_id'] = df['user_id'].astype('int64')
df['item_id'] = df['item_id'].astype('int64')
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
print(df.head())
print(df.dtypes)

cdf = cudf.DataFrame.from_pandas(df)

cat_features = ['item_id'] >> ops.Categorify()

cat_workflow = nvt.Workflow(cat_features)
cat_dataset = nvt.Dataset(cdf)

try:
    cat_transformed = cat_workflow.fit_transform(cat_dataset).to_ddf().compute()
    print("After Categorify:")
    print(cat_transformed.head())
except Exception as e:
    print(f"Error during Categorify: {e}")

print("Unique values in item_id:")
print(cdf['item_id'].unique())

Output:

Failed to fit operator <nvtabular.ops.categorify.Categorify object at 0x7fa86ddac1f0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py", line 532, in fit_phase
    stats.append(node.op.fit(node.input_columns, transformed_ddf))
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 400, in fit
    dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1551, in _category_stats
    return _groupby_to_disk(ddf, _write_uniques, options)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1406, in _groupby_to_disk
    _grouped_meta = _top_level_groupby(ddf._meta, options=options)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1017, in _top_level_groupby
    gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py", line 631, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size


  user_id  item_id           timestamp  \
0    16908      174 2024-01-03 14:49:27   
1    16908       78 2024-01-03 15:33:31   
2    16908       94 2024-01-03 16:01:57   
3    16908      174 2024-01-04 18:57:33   
4    16908       78 2024-01-04 18:59:41   


                                          event_type  
0  example1
1  example2
2  example3
3  example4
4  example5
user_id                int64
item_id                int64
timestamp     datetime64[ns]
event_type            object
dtype: object
Error during Categorify: function is not supported for this dtype: size
Unique values in item_id:
0    174
1     78
2     94
Name: item_id, dtype: int64

ohorban avatar Jul 09 '24 17:07 ohorban

same issue

Chevolier avatar Aug 15 '24 05:08 Chevolier

@ohorban can pls you try your pipeline without this line ( pls remove it) : df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')

In our examples, we feed a df to NVT pipelines with integer dytpe timestamp column, like here.

rnyak avatar Aug 15 '24 12:08 rnyak

Error during Categorify: function is not supported for this dtype: size

anuragreddygv323 avatar Sep 23 '24 22:09 anuragreddygv323

@anuragreddygv323 can u please provide more details?

  • what image you are using?
  • is this error coming from one of our examples or from your custom code?
  • what are the dtypes of your data?

Also we need a reproducible example to reproduce your error. thanks.

rnyak avatar Sep 24 '24 14:09 rnyak

Cuda 12.1 python 3.11

installed this cudf

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu12==24.8.* dask-cudf-cu12==24.8.* cuml-cu12==24.8.*
cugraph-cu12==24.8.* cuspatial-cu12==24.8.* cuproj-cu12==24.8.*
cuxfilter-cu12==24.8.* cucim-cu12==24.8.* pylibraft-cu12==24.8.*
raft-dask-cu12==24.8.* cuvs-cu12==24.8.* nx-cugraph-cu12==24.8.*

trying to run transforemer4rec tutorial and when Im trying to categorify its throwing the above error

I ran the example on the documentation and it gives me the same error import cudf import nvtabular as nvt

Create toy dataset

df = cudf.DataFrame({ 'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'], 'productID': [100, 101, 102, 101, 102, 103, 103], 'label': [0, 0, 1, 1, 1, 0, 0] }) dataset = nvt.Dataset(df)

Define pipeline

CATEGORICAL_COLUMNS = ['author', 'productID'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify( freq_threshold={"author": 3, "productID": 2}, num_buckets={"author": 10, "productID": 20})

Initialize the workflow and execute it

proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf()

Print results

print(ddf.compute())

anuragreddygv323 avatar Sep 24 '24 14:09 anuragreddygv323

@anuragreddygv323 we dont support cudf 24.8 (yet). You can use one of our docker images:

this one : nvcr.io/nvidia/merlin/merlin-tensorflow:23.08 or this one: nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

rnyak avatar Sep 24 '24 19:09 rnyak

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu11==23.08

is throwing an error @rnyak 

anuragreddygv323 avatar Sep 24 '24 20:09 anuragreddygv323

Installing cudf is not enough. you need dask-cudf as well. The cudf and dask-cudf versions in the 23.08 image are as follows:

cudf 23.4.0 dask 2023.1.1 dask-cuda 23.4.0 dask-cudf 23.4.0

I recommend you to use docker images.

Please refer to this page to install cudf: https://docs.rapids.ai/install/#pip

your driver version should be compatible with the cuda version and therefore the cudf version.

You can ask cudf related questions (like installation issues) in the rapids/cudf GH repo.

rnyak avatar Sep 24 '24 22:09 rnyak