CTGAN DataTransformer init parameters

This PR solve issue #7, it allows two things:

ability to fit gaussian mixtures on a subsample (help to scale with big datasets while losing only little accuracy)
ability to pass init arguments to the DataTransformer through CTGANSynthesizer.fit (and so to change other parameters as max_clusters).

Apr 16 '21 15:04 FlorentRamb

All committers have signed the CLA.

Apr 16 '21 15:04 CLAassistant

@fealho @pvk-developer Can we merge this? It's basically impossible to fit the CTGAN on a large dataset because the gaussian mixture is a huge bottleneck (even using dozens of CPUs). This PR would allow to speedup the gaussian mixture step. Thanks

Feb 09 '23 16:02 candalfigomoro

@npatki not sure what you want to do with this?

Feb 09 '23 18:02 fealho

Meanwhile the library code has changed so the PR should be updated.

For example, the _fit_continuous method now receives a pandas DataFrame, so np.random.choice() can be replaced by something like data = data.sample(self._max_gm_samples, replace=False, random_state=SEED).

Also, I wonder if ClusterBasedNormalizer could not be optionally replaced by a power transform, which might be faster (although it might impact the quality of the generated data), see https://github.com/sdv-dev/RDT/issues/613

Feb 13 '23 13:02 candalfigomoro