RDT
RDT copied to clipboard
[dtypes] `FloatFormatter` reverse transform does not support new pandas dtypes
Error Description
The FloatFormatter crashes due to hard coded logic attempting to cast to a numpy data type (line 191: is_integer = np.dtype(self._dtype).kind == 'i'). If the dtype differs from np.dtype, this causes a breaking error during reverse transformation. Which is the root cause of FloatFormatter not handling the new pandas data types.
Steps to reproduce
from rdt.transformers import FloatFormatter
import pandas as pd
data = {
'Int8': pd.Series([1, 2, -3], dtype='Int8'),
'Int16': pd.Series([1, 2, -3], dtype='Int16'),
'Int32': pd.Series([1, 2, -3], dtype='Int32'),
'Int64': pd.Series([1, 2, -3], dtype='Int64'),
'Float32': pd.Series([1.1, 2.2, 3.3], dtype='Float32'),
'Float64': pd.Series([1.1, 2.2, 3.3], dtype='Float64'),
}
df = pd.DataFrame(data)
ff = FloatFormatter()
ff.fit(df, 'Int8')
transformed = ff.transform(df)
ff.reverse_transform(transformed)
Expected behavior
- We should support this new dtypes. The best approach for this is to use is_integer_dtype from pandas. This also supports
numpy.dtypesanduints. - Add an integration test to make sure that we support this new dtypes with
Nullvalues.
Additional Context
Once this is fixed, we should be able to fit and sample from SDV (it is important to confirm sampling since fitting is already confirmed).