RDT icon indicating copy to clipboard operation
RDT copied to clipboard

LabelEncoder() with add_noise = True does not Support 'category' pandas dtype

Open R-Palazzo opened this issue 2 years ago • 1 comments

Problem Description

Currently, the LabelEncoder(add_noise = True) does not support 'category' pandas dtype. For instance, if I run the following code:

from rdt.transformers import LabelEncoder
data_test = pd.DataFrame({'A': ['a', 'b', 'a', 'a', 'c']})
data_test = data_test.astype('category')
transformer = HyperTransformer()
transformer.set_config({
  'sdtypes': {
    'A': "categorical"
  },
  'transformers': {
    'A': LabelEncoder(add_noise=True),
  } 
})
transformer.fit(data_test)
transformer.transform(data_test)

I get the error: TypeError: unsupported operand type(s) for +: 'Categorical' and 'int'

This comes from these lines in rdt/transformers/categorical.py:

if self.add_noise:
--> 527             mapped = np.random.uniform(mapped, mapped + 1)

Expected behavior

Supporting 'category' pandas dtype.

Additional context

However the code above works with FrequencyEncoder(add_noise=True/False) and LabelEncoder(add_noise=False)

R-Palazzo avatar Feb 24 '23 17:02 R-Palazzo

Workaround

For now, convert your column to the object dtype. (This is usually the default dtype whenever you are reading from a CSV or other data source anyways.)

In the above code:

# convert from the 'category' dtype to an object
data_test['A'] = data_test['A'].astype(object)

# now the transformations will work
transformer.fit(data_test)
transformed = transformer.transform(data_test)
reversed = transformer.reverse_transform(transformed)

npatki avatar Jun 09 '23 20:06 npatki