NVTabular [QST] TypeError: unhashable type: 'numpy.ndarray'

What is your question?

I'm trying to use Merlin to build 2 tower NN model. However, when I try to use nvtabular workflow to fit my dataset, It shows an error.

user_features = ([ "user_history_1", "user_history_2", "user_gender", "user_age", "platform", "object_section", "hour"] >> HashBucket({"user_history_1": 500000, "user_history_2": 100000, "user_gender": 3, "user_age" : 10, "platform" : 3, "object_section": 6, "hour": 24}) >> TagAsUserFeatures() )

outputs = user_id + item_id + item_hash_features + item_dense_features + user_features workflow = nvt.Workflow(outputs) train_dataset = nvt.Dataset(train_data) workflow.fit(train_dataset)

and after calling fit method, it returns an error:

TypeError: unhashable type: 'numpy.ndarray'

TypeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 workflow.fit(train_dataset)

Only two features, user_history_1 and user_history_2 are numpy array: contains the itemId that user visited.

e.g. [1705022, 1806090, 1801039, 1005001]

When I excluded user_history_1 and user_history_2 from input features, fit method worked successfully. Therefore, I suspect these two features as the reason of error message.

As it says numpy.ndarray is unhashable, I converted it to list. However, I still see the same error message.

Does anyone have a suggestion for debugging?

Nov 21 '23 07:11 dking21st

I observe a similar problem in Categorify with the hashing of infrequent items. Here is the minimal example:

import nvtabular as nvt
import pandas as pd

df = pd.DataFrame({"items": [[1, 2, 3], [1, 2], [1, 2, 4, 4]]})
dataset = nvt.Dataset(df)

feats = [
    "items",
] >> nvt.ops.Categorify(
    freq_threshold=2,
    num_buckets=10,
)

workflow = nvt.Workflow(feats)
processed_ds = workflow.fit_transform(dataset)

Error:

File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:510, in Categorify.transform(self, col_selector, df)
    508 path = self.categories[storage_name]
--> 510 encoded = _encode(
    511     use_name,
    512     storage_name,
    513     path,
    514     df,
    515     self.cat_cache,
    516     freq_threshold=self.freq_threshold[name]
    517     if isinstance(self.freq_threshold, dict)
    518     else self.freq_threshold,
    519     search_sorted=self.search_sorted,
    520     buckets=self.num_buckets,
    521     encode_type=self.encode_type,
    522     cat_names=column_names,
    523     max_size=self.max_size,
    524     dtype=self.output_dtype,
    525     split_out=(
    526         self.split_out.get(storage_name, 1)
    527         if isinstance(self.split_out, dict)
    528         else self.split_out
    529     ),
    530     single_table=self.single_table,
    531 )
    532 new_df[name] = encoded

File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:1717, in _encode(name, storage_name, path, df, cat_cache, freq_threshold, search_sorted, buckets, encode_type, cat_names, max_size, dtype, split_out, single_table)
   1714 if buckets and storage_name in buckets:
   1715     # apply hashing for "infrequent" categories
   1716     indistinct = (
-> 1717         _hash_bucket(df, buckets, selection_l.names, encode_type=encode_type)
   1718         + bucket_encoding_offset
   1719     )
   1721     if use_collection:
   1722         # Manual broadcast merge

File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:1844, in _hash_bucket(df, num_buckets, col, encode_type)
   1843     nb = num_buckets[col[0]]
-> 1844     encoded = dispatch.hash_series(df[col[0]]) % nb
   1845 elif encode_type == "combo":

File .../lib/python3.10/site-packages/merlin/core/dispatch.py:294, in hash_series(ser)
    288 if isinstance(ser, pd.Series):
    289     # Using pandas hashing, which does not produce the
    290     # same result as cudf.Series.hash_values().  Do not
    291     # expect hash-based data transformations to be the
    292     # same on CPU and CPU.  TODO: Fix this (maybe use
    293     # murmurhash3 manually on CPU).
--> 294     return hash_object_dispatch(ser).values
    295 elif cudf and isinstance(ser, cudf.Series):

File .../lib/python3.10/site-packages/dask/utils.py:642, in Dispatch.__call__(self, arg, *args, **kwargs)
    641 meth = self.dispatch(type(arg))
--> 642 return meth(arg, *args, **kwargs)

File .../lib/python3.10/site-packages/dask/dataframe/backends.py:502, in hash_object_pandas(obj, index, encoding, hash_key, categorize)
    498 @hash_object_dispatch.register((pd.DataFrame, pd.Series, pd.Index))
    499 def hash_object_pandas(
    500     obj, index=True, encoding="utf8", hash_key=None, categorize=True
    501 ):
--> 502     return pd.util.hash_pandas_object(
    503         obj, index=index, encoding=encoding, hash_key=hash_key, categorize=categorize
    504     )

File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:126, in hash_pandas_object(obj, index, encoding, hash_key, categorize)
    125 elif isinstance(obj, ABCSeries):
--> 126     h = hash_array(obj._values, encoding, hash_key, categorize).astype(
    127         "uint64", copy=False
    128     )
    129     if index:

File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:308, in hash_array(vals, encoding, hash_key, categorize)
    303     raise TypeError(
    304         "hash_array requires np.ndarray or ExtensionArray, not "
    305         f"{type(vals).__name__}. Use hash_pandas_object instead."
    306     )
--> 308 return _hash_ndarray(vals, encoding, hash_key, categorize)

File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:346, in _hash_ndarray(vals, encoding, hash_key, categorize)
    340 from pandas import (
    341     Categorical,
    342     Index,
    343     factorize,
    344 )
--> 346 codes, categories = factorize(vals, sort=False)
    347 cat = Categorical(
    348     codes, Index._with_infer(categories), ordered=False, fastpath=True
    349 )

File .../lib/python3.10/site-packages/pandas/core/algorithms.py:822, in factorize(values, sort, na_sentinel, use_na_sentinel, size_hint)
    820             values = np.where(null_mask, na_value, values)
--> 822     codes, uniques = factorize_array(
    823         values,
    824         na_sentinel=na_sentinel_arg,
    825         size_hint=size_hint,
    826     )
    828 if sort and len(uniques) > 0:

File .../lib/python3.10/site-packages/pandas/core/algorithms.py:578, in factorize_array(values, na_sentinel, size_hint, na_value, mask)
    577 table = hash_klass(size_hint or len(values))
--> 578 uniques, codes = table.factorize(
    579     values,
    580     na_sentinel=na_sentinel,
    581     na_value=na_value,
    582     mask=mask,
    583     ignore_na=ignore_na,
    584 )
    586 # re-cast e.g. i8->dt64/td64, uint8->bool

File pandas/_libs/hashtable_class_helper.pxi:5943, in pandas._libs.hashtable.PyObjectHashTable.factorize()

File pandas/_libs/hashtable_class_helper.pxi:5857, in pandas._libs.hashtable.PyObjectHashTable._unique()

TypeError: unhashable type: 'list'

Do you think it should be a bug report?

EDIT:

NVT version: 23.08.00
from the docker: nvcr.io/nvidia/merlin/merlin-pytorch:23.08

Nov 22 '23 13:11 piojanu

solution worked for me: although there was gpu existing, my notebook was running with cpu instead and that forced nvtabular to run with cpu mode. After installing RAPID (https://docs.rapids.ai/install#pip) and restart kernel, it started to grab gpu automatically and the issue is resolved.

Nov 23 '23 01:11 dking21st

I can confirm my code only errors out on the CPU too. On the GPU it works fine. Still, this is a bug.

Nov 25 '23 14:11 piojanu

ok this error is occurring again, even after installing RAPIDS... can someone help?

I ran following codes to check if GPU exists

device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
    print("Found GPU at: {}".format(device_name))
else:
    device_name = "/device:CPU:0"
    print("No GPU, using {}.".format(device_name))

and it returns

Found GPU at: /device:GPU:0

so it has GPU but nvtabular dataset is keep forcing to use CPU.

Dec 13 '23 01:12 dking21st