SDV Support of pandas dtypes (needed for integers with missing values)

Support of pandas dtypes (needed for integers with missing values)

Open nuldertien opened this issue 2 years ago • 2 comments

Problem Description

I have a column in my dataset that has integers and nan values. The way I transform my columns currently, in order to deal with integers (no decimals) and nan values, is by transforming it to a 'Int64' dtype, more specifically; pd.Int64Dtype(). However after training a sdv model with this dtype it provides errors when I want to sample ("Cannot interpret 'Int64Dtype()' as a data type").

Expected behavior

Be able to support pandas dtypes such that I am able to train and sample on this kind of data.

Additional context

I transformed the column with .astype('Int64'), more specifically with round(pd.to_numeric(dataframe['column1'], errors='coerce')).astype('Int64'). Such that: {'column1':[123500,56832,<NA>]}, where the type() of each corresponds to [np.int64, np.int64, pandas._libs.missing.NAType]. The used metadata is provided below.

"fields": { "column1": { "type": "numerical", "subtype": "integer" }

Dec 23 '22 08:12 nuldertien

Thanks for filing @nuldertien -- we'll keep this issue open for tracking purposes and communicating progress.

For anyone seeing this issue for the first time, here is a suggestion in the meantime:

The SDV is smart enough to recognize that all values in the column are whole numbers. So even if you leave the column as float64 for now, any decimals you see should always end in .0. While this not ideal in terms of data representation, it should hopefully still give you usable synthetic data.

Dec 23 '22 19:12 npatki

I just got this same error so I'd like to point out this is an ongoing issue

May 30 '24 20:05 ryantimjohn

SDV SDV copied to clipboard

Support of pandas dtypes (needed for integers with missing values)

Problem Description

Expected behavior

Additional context

SDV
SDV copied to clipboard