pytorch_tabular icon indicating copy to clipboard operation
pytorch_tabular copied to clipboard

category encoder fails when there is a value in valid which was not present in train

Open kegl opened this issue 1 year ago • 3 comments

I have some sparse boolean columns with very few Trues, and when it happens that there is not True in train but there are some in validation, the category encoder replaces the Trues with nans. Took me a while to figure out the source of the error, the trace is this:

File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pytorch_tabular/tabular_model.py", line 754, in fit
    datamodule = self.prepare_dataloader(

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pytorch_tabular/tabular_model.py", line 537, in prepare_dataloader
    datamodule.setup("fit")

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pytorch_tabular/tabular_datamodule.py", line 510, in setup
    self._cache_dataset()

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pytorch_tabular/tabular_datamodule.py", line 456, in _cache_dataset
    validation_dataset = TabularDataset(

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pytorch_tabular/tabular_datamodule.py", line 78, in __init__
    self.categorical_X = self.categorical_X.astype(np.int64).values

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/generic.py", line 6534, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 414, in astype
    return self.apply(

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 354, in apply
    applied = getattr(b, f)(**kwargs)

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 616, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 238, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 183, in astype_array
    values = _astype_nansafe(values, dtype, copy=copy)

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 101, in _astype_nansafe
    return _astype_float_to_int_nansafe(arr, dtype, copy)

  File "/home/scripts/mbrl-tools/.venv/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 146, in _astype_float_to_int_nansafe
    raise IntCastingNaNError(

pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

kegl avatar Feb 17 '24 16:02 kegl

This shoudn't be the case. the category encoder is supposed to be robust enough to catch this. Let me take a look at this. #406 Also seems to be related to the same issue

manujosephv avatar Mar 09 '24 02:03 manujosephv

@kegl Any ways you can share a reproducible and self-contained minimal example?

https://stackoverflow.com/help/minimal-reproducible-example

manujosephv avatar Mar 09 '24 12:03 manujosephv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 08 '24 18:05 stale[bot]