RDT
RDT copied to clipboard
LabelEncoder() with add_noise = True does not Support 'category' pandas dtype
Problem Description
Currently, the LabelEncoder(add_noise = True) does not support 'category' pandas dtype. For instance, if I run the following code:
from rdt.transformers import LabelEncoder
data_test = pd.DataFrame({'A': ['a', 'b', 'a', 'a', 'c']})
data_test = data_test.astype('category')
transformer = HyperTransformer()
transformer.set_config({
'sdtypes': {
'A': "categorical"
},
'transformers': {
'A': LabelEncoder(add_noise=True),
}
})
transformer.fit(data_test)
transformer.transform(data_test)
I get the error:
TypeError: unsupported operand type(s) for +: 'Categorical' and 'int'
This comes from these lines in rdt/transformers/categorical.py:
if self.add_noise:
--> 527 mapped = np.random.uniform(mapped, mapped + 1)
Expected behavior
Supporting 'category' pandas dtype.
Additional context
However the code above works with FrequencyEncoder(add_noise=True/False) and LabelEncoder(add_noise=False)
Workaround
For now, convert your column to the object dtype. (This is usually the default dtype whenever you are reading from a CSV or other data source anyways.)
In the above code:
# convert from the 'category' dtype to an object
data_test['A'] = data_test['A'].astype(object)
# now the transformations will work
transformer.fit(data_test)
transformed = transformer.transform(data_test)
reversed = transformer.reverse_transform(transformed)