I tried running NVTabular code related to this and this, but I could not get past the line of code with the Workflow.

The error is:

File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate TypeError: function is not supported for this dtype: size

which occurs after calling Categorify.

Is there something I need to check in order to get NVTabular working? Any additional information from me to solve this issue?

Thanks!

Jul 08 '24 23:07 LoMarrujo

I have been struggling with the same exact issue for the last few days. Example code that i wrote while trying to debug:

import pandas as pd
import nvtabular as nvt
from nvtabular import ops
import cudf

# Sample Data
data = {
    'user_id': [16908, 16908, 16908, 16908, 16908],
    'item_id': [174, 78, 94, 174, 78],
    'timestamp': [
        '2024-01-03 14:49:27',
        '2024-01-03 15:33:31',
        '2024-01-03 16:01:57',
        '2024-01-04 18:57:33',
        '2024-01-04 18:59:41'
    ],
    'event_type': [
        'example1',
        'example2',
        'example13',
        'example4',
        'example5'
    ]
}
df = pd.DataFrame(data)
df['user_id'] = df['user_id'].astype('int64')
df['item_id'] = df['item_id'].astype('int64')
df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')
print(df.head())
print(df.dtypes)

cdf = cudf.DataFrame.from_pandas(df)

cat_features = ['item_id'] >> ops.Categorify()

cat_workflow = nvt.Workflow(cat_features)
cat_dataset = nvt.Dataset(cdf)

try:
    cat_transformed = cat_workflow.fit_transform(cat_dataset).to_ddf().compute()
    print("After Categorify:")
    print(cat_transformed.head())
except Exception as e:
    print(f"Error during Categorify: {e}")

print("Unique values in item_id:")
print(cdf['item_id'].unique())

Output:

Failed to fit operator <nvtabular.ops.categorify.Categorify object at 0x7fa86ddac1f0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/merlin/dag/executors.py", line 532, in fit_phase
    stats.append(node.op.fit(node.input_columns, transformed_ddf))
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 400, in fit
    dsk, key = _category_stats(ddf, self._create_fit_options_from_columns(columns))
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1551, in _category_stats
    return _groupby_to_disk(ddf, _write_uniques, options)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1406, in _groupby_to_disk
    _grouped_meta = _top_level_groupby(ddf._meta, options=options)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nvtabular/ops/categorify.py", line 1017, in _top_level_groupby
    gb = df_gb.groupby(cat_col_selector.names, dropna=False).agg(agg_dict)
  File "/opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/cudf/core/groupby/groupby.py", line 631, in agg
    ) = self._groupby.aggregate(columns, normalized_aggs)
  File "groupby.pyx", line 192, in cudf._lib.groupby.GroupBy.aggregate
TypeError: function is not supported for this dtype: size


  user_id  item_id           timestamp  \
0    16908      174 2024-01-03 14:49:27   
1    16908       78 2024-01-03 15:33:31   
2    16908       94 2024-01-03 16:01:57   
3    16908      174 2024-01-04 18:57:33   
4    16908       78 2024-01-04 18:59:41   


                                          event_type  
0  example1
1  example2
2  example3
3  example4
4  example5
user_id                int64
item_id                int64
timestamp     datetime64[ns]
event_type            object
dtype: object
Error during Categorify: function is not supported for this dtype: size
Unique values in item_id:
0    174
1     78
2     94
Name: item_id, dtype: int64

Jul 09 '24 17:07 ohorban

same issue

Aug 15 '24 05:08 Chevolier

@ohorban can pls you try your pipeline without this line ( pls remove it) : df['timestamp'] = pd.to_datetime(df['timestamp']).astype('datetime64[s]')

In our examples, we feed a df to NVT pipelines with integer dytpe timestamp column, like here.

Aug 15 '24 12:08 rnyak

Error during Categorify: function is not supported for this dtype: size

Sep 23 '24 22:09 anuragreddygv323

@anuragreddygv323 can u please provide more details?

what image you are using?
is this error coming from one of our examples or from your custom code?
what are the dtypes of your data?

Also we need a reproducible example to reproduce your error. thanks.

Sep 24 '24 14:09 rnyak

Cuda 12.1 python 3.11

installed this cudf

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu12==24.8.* dask-cudf-cu12==24.8.* cuml-cu12==24.8.*
cugraph-cu12==24.8.* cuspatial-cu12==24.8.* cuproj-cu12==24.8.*
cuxfilter-cu12==24.8.* cucim-cu12==24.8.* pylibraft-cu12==24.8.*
raft-dask-cu12==24.8.* cuvs-cu12==24.8.* nx-cugraph-cu12==24.8.*

trying to run transforemer4rec tutorial and when Im trying to categorify its throwing the above error

I ran the example on the documentation and it gives me the same error import cudf import nvtabular as nvt

Create toy dataset

df = cudf.DataFrame({ 'author': ['User_A', 'User_B', 'User_C', 'User_C', 'User_A', 'User_B', 'User_A'], 'productID': [100, 101, 102, 101, 102, 103, 103], 'label': [0, 0, 1, 1, 1, 0, 0] }) dataset = nvt.Dataset(df)

Define pipeline

CATEGORICAL_COLUMNS = ['author', 'productID'] cat_features = CATEGORICAL_COLUMNS >> nvt.ops.Categorify( freq_threshold={"author": 3, "productID": 2}, num_buckets={"author": 10, "productID": 20})

Initialize the workflow and execute it

proc = nvt.Workflow(cat_features) proc.fit(dataset) ddf = proc.transform(dataset).to_ddf()

Print results

print(ddf.compute())

Sep 24 '24 14:09 anuragreddygv323

@anuragreddygv323 we dont support cudf 24.8 (yet). You can use one of our docker images:

this one : nvcr.io/nvidia/merlin/merlin-tensorflow:23.08 or this one: nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

Sep 24 '24 19:09 rnyak

pip install
--extra-index-url=https://pypi.nvidia.com
cudf-cu11==23.08

is throwing an error @rnyak

Sep 24 '24 20:09 anuragreddygv323

Installing cudf is not enough. you need dask-cudf as well. The cudf and dask-cudf versions in the 23.08 image are as follows:

cudf 23.4.0 dask 2023.1.1 dask-cuda 23.4.0 dask-cudf 23.4.0

I recommend you to use docker images.

Please refer to this page to install cudf: https://docs.rapids.ai/install/#pip

your driver version should be compatible with the cuda version and therefore the cudf version.

You can ask cudf related questions (like installation issues) in the rapids/cudf GH repo.

Sep 24 '24 22:09 rnyak

[QST] NVTabular function is not supported for this dtype: size

Create toy dataset

Define pipeline

Initialize the workflow and execute it

Print results