[QST] TypeError: unhashable type: 'numpy.ndarray'
What is your question?
I'm trying to use Merlin to build 2 tower NN model. However, when I try to use nvtabular workflow to fit my dataset, It shows an error.
user_features = ([ "user_history_1", "user_history_2", "user_gender", "user_age", "platform", "object_section", "hour"] >> HashBucket({"user_history_1": 500000, "user_history_2": 100000, "user_gender": 3, "user_age" : 10, "platform" : 3, "object_section": 6, "hour": 24}) >> TagAsUserFeatures() )
outputs = user_id + item_id + item_hash_features + item_dense_features + user_features workflow = nvt.Workflow(outputs) train_dataset = nvt.Dataset(train_data) workflow.fit(train_dataset)
and after calling fit method, it returns an error:
TypeError: unhashable type: 'numpy.ndarray'
TypeError Traceback (most recent call last) Cell In[18], line 1 ----> 1 workflow.fit(train_dataset)
Only two features, user_history_1 and user_history_2 are numpy array: contains the itemId that user visited.
e.g. [1705022, 1806090, 1801039, 1005001]
When I excluded user_history_1 and user_history_2 from input features, fit method worked successfully. Therefore, I suspect these two features as the reason of error message.
As it says numpy.ndarray is unhashable, I converted it to list. However, I still see the same error message.
Does anyone have a suggestion for debugging?
I observe a similar problem in Categorify with the hashing of infrequent items. Here is the minimal example:
import nvtabular as nvt
import pandas as pd
df = pd.DataFrame({"items": [[1, 2, 3], [1, 2], [1, 2, 4, 4]]})
dataset = nvt.Dataset(df)
feats = [
"items",
] >> nvt.ops.Categorify(
freq_threshold=2,
num_buckets=10,
)
workflow = nvt.Workflow(feats)
processed_ds = workflow.fit_transform(dataset)
Error:
File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:510, in Categorify.transform(self, col_selector, df)
508 path = self.categories[storage_name]
--> 510 encoded = _encode(
511 use_name,
512 storage_name,
513 path,
514 df,
515 self.cat_cache,
516 freq_threshold=self.freq_threshold[name]
517 if isinstance(self.freq_threshold, dict)
518 else self.freq_threshold,
519 search_sorted=self.search_sorted,
520 buckets=self.num_buckets,
521 encode_type=self.encode_type,
522 cat_names=column_names,
523 max_size=self.max_size,
524 dtype=self.output_dtype,
525 split_out=(
526 self.split_out.get(storage_name, 1)
527 if isinstance(self.split_out, dict)
528 else self.split_out
529 ),
530 single_table=self.single_table,
531 )
532 new_df[name] = encoded
File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:1717, in _encode(name, storage_name, path, df, cat_cache, freq_threshold, search_sorted, buckets, encode_type, cat_names, max_size, dtype, split_out, single_table)
1714 if buckets and storage_name in buckets:
1715 # apply hashing for "infrequent" categories
1716 indistinct = (
-> 1717 _hash_bucket(df, buckets, selection_l.names, encode_type=encode_type)
1718 + bucket_encoding_offset
1719 )
1721 if use_collection:
1722 # Manual broadcast merge
File .../lib/python3.10/site-packages/nvtabular/ops/categorify.py:1844, in _hash_bucket(df, num_buckets, col, encode_type)
1843 nb = num_buckets[col[0]]
-> 1844 encoded = dispatch.hash_series(df[col[0]]) % nb
1845 elif encode_type == "combo":
File .../lib/python3.10/site-packages/merlin/core/dispatch.py:294, in hash_series(ser)
288 if isinstance(ser, pd.Series):
289 # Using pandas hashing, which does not produce the
290 # same result as cudf.Series.hash_values(). Do not
291 # expect hash-based data transformations to be the
292 # same on CPU and CPU. TODO: Fix this (maybe use
293 # murmurhash3 manually on CPU).
--> 294 return hash_object_dispatch(ser).values
295 elif cudf and isinstance(ser, cudf.Series):
File .../lib/python3.10/site-packages/dask/utils.py:642, in Dispatch.__call__(self, arg, *args, **kwargs)
641 meth = self.dispatch(type(arg))
--> 642 return meth(arg, *args, **kwargs)
File .../lib/python3.10/site-packages/dask/dataframe/backends.py:502, in hash_object_pandas(obj, index, encoding, hash_key, categorize)
498 @hash_object_dispatch.register((pd.DataFrame, pd.Series, pd.Index))
499 def hash_object_pandas(
500 obj, index=True, encoding="utf8", hash_key=None, categorize=True
501 ):
--> 502 return pd.util.hash_pandas_object(
503 obj, index=index, encoding=encoding, hash_key=hash_key, categorize=categorize
504 )
File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:126, in hash_pandas_object(obj, index, encoding, hash_key, categorize)
125 elif isinstance(obj, ABCSeries):
--> 126 h = hash_array(obj._values, encoding, hash_key, categorize).astype(
127 "uint64", copy=False
128 )
129 if index:
File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:308, in hash_array(vals, encoding, hash_key, categorize)
303 raise TypeError(
304 "hash_array requires np.ndarray or ExtensionArray, not "
305 f"{type(vals).__name__}. Use hash_pandas_object instead."
306 )
--> 308 return _hash_ndarray(vals, encoding, hash_key, categorize)
File .../lib/python3.10/site-packages/pandas/core/util/hashing.py:346, in _hash_ndarray(vals, encoding, hash_key, categorize)
340 from pandas import (
341 Categorical,
342 Index,
343 factorize,
344 )
--> 346 codes, categories = factorize(vals, sort=False)
347 cat = Categorical(
348 codes, Index._with_infer(categories), ordered=False, fastpath=True
349 )
File .../lib/python3.10/site-packages/pandas/core/algorithms.py:822, in factorize(values, sort, na_sentinel, use_na_sentinel, size_hint)
820 values = np.where(null_mask, na_value, values)
--> 822 codes, uniques = factorize_array(
823 values,
824 na_sentinel=na_sentinel_arg,
825 size_hint=size_hint,
826 )
828 if sort and len(uniques) > 0:
File .../lib/python3.10/site-packages/pandas/core/algorithms.py:578, in factorize_array(values, na_sentinel, size_hint, na_value, mask)
577 table = hash_klass(size_hint or len(values))
--> 578 uniques, codes = table.factorize(
579 values,
580 na_sentinel=na_sentinel,
581 na_value=na_value,
582 mask=mask,
583 ignore_na=ignore_na,
584 )
586 # re-cast e.g. i8->dt64/td64, uint8->bool
File pandas/_libs/hashtable_class_helper.pxi:5943, in pandas._libs.hashtable.PyObjectHashTable.factorize()
File pandas/_libs/hashtable_class_helper.pxi:5857, in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'list'
Do you think it should be a bug report?
EDIT:
- NVT version:
23.08.00 - from the docker:
nvcr.io/nvidia/merlin/merlin-pytorch:23.08
solution worked for me: although there was gpu existing, my notebook was running with cpu instead and that forced nvtabular to run with cpu mode. After installing RAPID (https://docs.rapids.ai/install#pip) and restart kernel, it started to grab gpu automatically and the issue is resolved.
I can confirm my code only errors out on the CPU too. On the GPU it works fine. Still, this is a bug.
ok this error is occurring again, even after installing RAPIDS... can someone help?
I ran following codes to check if GPU exists
device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
print("Found GPU at: {}".format(device_name))
else:
device_name = "/device:CPU:0"
print("No GPU, using {}.".format(device_name))
and it returns
Found GPU at: /device:GPU:0
so it has GPU but nvtabular dataset is keep forcing to use CPU.