CTGAN
CTGAN copied to clipboard
Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data
In columns where the continuous data is distributed in a really non-gaussian approximable way (e.g. Dates that increase in frequency) and follow a line are not well approximated with the GMM. I've not used the BGMT that much, because it is much slower, but if this does not occur there, please correct me. However, using a GMM, the following pattern occurs. The plots show the cumulative distribution.
Where you can clearly see the several gaussian that are fit to the curve, resulting in a not horrible but definitly not great fit. Do you have any thoughts on how this can be improved?
In TGAN, this problem was much less, and the curves looked as follows. In preprocessing, I think the only difference is using 4 x std instead of 2 x std. Apart from the architecture that's different, I can't immediately think of a reason for this behaviour.
Hi, I'm not sure if I understand your plots correctly. Are these plots about the cumulative distribution for synthetic data (generated by GAN) and real (training) data?
Both, orange is synthetic, blue is real. I'll upload some clearer plots tomorrow.
This is a bit clearer with a distribution plot. For all plots: blue is real data of one column, orange is fake data of the same column. The fake data in this plot was generated with CTGAN.
For example, this was from data generated with TGAN:
And from my WGAN adaptation of TGAN:
So in these plots, we see a clear decrease in spikyness of the generated data. I'm trying to figure out what causes this, cause the data in the TGAN-WGAN is modelled quite well, while the data in CTGAN and TGAN is quite clearly several smaller distributions.
@Baukebrenninkmeijer just to confirm, the TGAN-WGAN implementation you are talking about is https://github.com/Baukebrenninkmeijer/On-the-Generation-and-Evaluation-of-Synthetic-Tabular-Data-using-GANs/tree/master/tgan_wgan_gp?
@shreyanshs Yes, that is correct. However, it's quite old now.