models
models copied to clipboard
[BUG] Error with prepare_aliccp()
I ran into this error when running prepare_aliccp()
on downloaded Ali-CCP datasets.
Traceback (most recent call last):
File "/share/suh-scrap/zh338/aliccp/preprocess.py", line 13, in <module>
prepare_aliccp(DATA_DIR, convert_train=False, convert_test=True)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 164, in prepare_aliccp
_convert_data(
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 449, in _convert_data
merlin.io.Dataset(tmp_files, dtypes=dtypes).to_parquet(out_dir)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 380, in __init__
self.infer_schema()
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1240, in infer_schema
dtypes = self.sample_dtypes(n=n, annotate_lists=True)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1264, in sample_dtypes
_real_meta = _set_dtypes(_real_meta, self.dtypes)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1301, in _set_dtypes
chunk[col] = chunk[col].astype(dtype)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 448, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
applied = getattr(b, f)(**kwargs)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array
values = astype_nansafe(values, dtype, copy=copy)
File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 170, in astype_nansafe
return arr.astype(dtype, copy=True)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
I saw another issue (#507 ) talking about a similar problem but didn't really mention the solution/workaround, so I'm wondering what is a workaround to avoid this error?
Thanks!
Same issue here, any updates ?
The dataset contains None values as seen if you display the head of the dataset
I solved it by changing
https://github.com/NVIDIA-Merlin/models/blob/eb1e54196a64a70950b2a7e7744d2150e052d53e/merlin/datasets/ecommerce/aliccp/dataset.py#L448
to
dtypes = {f.name: "Int32" for f in _Features().features}
( Int32 with capital means nullable integer )
with the new dtypes