CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

Gaussian approximation of continuous variables really clear in non-gaussian/non-multimodal data

Open Baukebrenninkmeijer opened this issue 5 years ago • 5 comments

In columns where the continuous data is distributed in a really non-gaussian approximable way (e.g. Dates that increase in frequency) and follow a line are not well approximated with the GMM. I've not used the BGMT that much, because it is much slower, but if this does not occur there, please correct me. However, using a GMM, the following pattern occurs. The plots show the cumulative distribution.
image

Where you can clearly see the several gaussian that are fit to the curve, resulting in a not horrible but definitly not great fit. Do you have any thoughts on how this can be improved?

In TGAN, this problem was much less, and the curves looked as follows. In preprocessing, I think the only difference is using 4 x std instead of 2 x std. Apart from the architecture that's different, I can't immediately think of a reason for this behaviour. image

Baukebrenninkmeijer avatar Nov 26 '19 15:11 Baukebrenninkmeijer

Hi, I'm not sure if I understand your plots correctly. Are these plots about the cumulative distribution for synthetic data (generated by GAN) and real (training) data?

leix28 avatar Dec 02 '19 16:12 leix28

Both, orange is synthetic, blue is real. I'll upload some clearer plots tomorrow.

Baukebrenninkmeijer avatar Dec 02 '19 21:12 Baukebrenninkmeijer

This is a bit clearer with a distribution plot. For all plots: blue is real data of one column, orange is fake data of the same column. The fake data in this plot was generated with CTGAN.

image

For example, this was from data generated with TGAN:

image

And from my WGAN adaptation of TGAN:

image

So in these plots, we see a clear decrease in spikyness of the generated data. I'm trying to figure out what causes this, cause the data in the TGAN-WGAN is modelled quite well, while the data in CTGAN and TGAN is quite clearly several smaller distributions.

Baukebrenninkmeijer avatar Dec 03 '19 14:12 Baukebrenninkmeijer

@Baukebrenninkmeijer just to confirm, the TGAN-WGAN implementation you are talking about is https://github.com/Baukebrenninkmeijer/On-the-Generation-and-Evaluation-of-Synthetic-Tabular-Data-using-GANs/tree/master/tgan_wgan_gp?

shreyanshs avatar Nov 12 '20 05:11 shreyanshs

@shreyanshs Yes, that is correct. However, it's quite old now.

Baukebrenninkmeijer avatar Nov 13 '20 15:11 Baukebrenninkmeijer