Add a "mixed data" transformer to RDT, for use in CTGAN.
I've come across the paper for CTAB-GAN, which implements a transformer I would very much like to see in RDT, to use with SDV's synthesizers.
They created a mixed-type encoder to deal with continuous variables that have some categorical property. I've been running into this issue myself. When dealing with loan data, the amount of outstanding debt is treated as a continuous variable by CTGAN, but this approach misses some of the nuances.
In many cases, the outstanding debt is 0. Exactly 0. I've found that CTGAN has a hard time grasping this idea, using the FloatFormatter. The synthesized data will have lots of values close, but not exactly zero. Post-processing would be an option, but I feel like this does not solve the underlying problem. Plausibly, the occurence of such mixed variables makes it very easy for the discriminator, and difficult for the generator. For "exactly 0" on some columns might arise as an easy-to-spot characteristic of the real data.
I'm very interested in what your opinion is on this. Particularly, do you think this would have an impact on the CTGAN loss function? Is there, currently, an easy way to mimic the idea of CTAB-GAN, using just the SDV package? Would the implementation of this mixed-encoder be a valuable addition to the SDV ecosystem?
Kind regards, Wilco
Hi @wilcovanvorstenbosch, thanks for providing an explanation and linking the paper. Very interesting topic!
It seems like "mixed type data" is referring not so much to the type (int, float, etc.) of the data, but rather the groups that it can represent. For example, in the outstanding debt column you mention:
- a
0value itself could be considered as one type of data (a category representing no debt) - any other value is a different type (a numerical value representing some debt)
I'd like to note that SDV is designed to handle mixed type data specifically when it comes to missing values (nulls). For example, the FloatFormatter has provisions to handle nulls as a special category while dealing with non-null values separately. You can even supply missing_value_generation='from_column' for even higher synthetic data quality.
One quick test you can do: In your loans dataset, what happens if you replace the exact 0 values of outstanding debt to missing values instead? I suspect your synthetic data quality may improve quite a bit because SDV recognizes this type of mixed data.
import numpy as np
# replace the exact 0 values with missing values
real_data_copy = real_data.copy()
real_data_copy['outstanding_debt'] = real_data_copy['outstanding_debt'].replace(0.0, np.nan)
# now run the data with missing values through your synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data_copy)
synthetic_data = synthesizer.sample(num_rows=len(real_data_copy))
# now convert the synthesized missing values back to 0.0 values
synthetic_data['oustanding_debt'] = synthetic_data['outstanding_debt'].replace(np.nan, 0.0)
If this does end up working out, it would make for a very interesting feature to RDT. Even better if the RDT could proactively identify that 0 is a separate type than the others.
Hey @npatki,
That is indeed what I meant. It's not a different dtype, but a similar idea as 'from_column', albeit with a different solution. I expect the same, performance-wise!
Notice that the paper describes using a gaussian mixture model. But in case of RDT, considering users are used to the 'from_column' attribute, I think it would be valuable to create a similar mechanism for dealing with pure 0's, or any other value defined by the user.
If you know any way to easily implement this myself, let me know :)
Kind regards, Wilco
Hi @wilcovanvorstenbosch, yes we're definitely on the same page.
the paper describes using a gaussian mixture model
The ClusterBasedNormalizer RDT does use Bayesian GMMs to create clusters. This might be something to try too. However, we'd be reliant on the GMM understanding that exactly 0 belongs to a cluster.
If you know any way to easily implement this myself, let me know :)
As a workaround, were you able to follow the code snippet I provided in my previous reply? Here, I am showing how to replace your 0 values with missing values (NaNs) for the purposes of using SDV. You can replace 0 with any other constant value that you have.
I'm moving this into the RDT library since it's ultimately an RDT feature request (that would be used with CTGAN). Will keep this open as a feature request. We can use it to track any solutions/updates.