RDT icon indicating copy to clipboard operation
RDT copied to clipboard

[dtypes] `FloatFormatter` reverse transform does not support new pandas dtypes

Open pvk-developer opened this issue 1 year ago • 0 comments

Error Description

The FloatFormatter crashes due to hard coded logic attempting to cast to a numpy data type (line 191: is_integer = np.dtype(self._dtype).kind == 'i'). If the dtype differs from np.dtype, this causes a breaking error during reverse transformation. Which is the root cause of FloatFormatter not handling the new pandas data types.

Steps to reproduce

from rdt.transformers import FloatFormatter
import pandas as pd
data = {
    'Int8': pd.Series([1, 2, -3], dtype='Int8'),
    'Int16': pd.Series([1, 2, -3], dtype='Int16'),
    'Int32': pd.Series([1, 2, -3], dtype='Int32'),
    'Int64': pd.Series([1, 2, -3], dtype='Int64'),
    'Float32': pd.Series([1.1, 2.2, 3.3], dtype='Float32'),
    'Float64': pd.Series([1.1, 2.2, 3.3], dtype='Float64'),
}
df = pd.DataFrame(data)

ff = FloatFormatter()
ff.fit(df, 'Int8')
transformed = ff.transform(df)
ff.reverse_transform(transformed)

Expected behavior

  • We should support this new dtypes. The best approach for this is to use is_integer_dtype from pandas. This also supports numpy.dtypes and uints.
  • Add an integration test to make sure that we support this new dtypes with Null values.

Additional Context

Once this is fixed, we should be able to fit and sample from SDV (it is important to confirm sampling since fitting is already confirmed).

pvk-developer avatar Jul 31 '24 14:07 pvk-developer