DataTransformer init parameters
This PR solve issue #7, it allows two things:
- ability to fit gaussian mixtures on a subsample (help to scale with big datasets while losing only little accuracy)
- ability to pass init arguments to the
DataTransformerthroughCTGANSynthesizer.fit(and so to change other parameters asmax_clusters).
@fealho @pvk-developer Can we merge this? It's basically impossible to fit the CTGAN on a large dataset because the gaussian mixture is a huge bottleneck (even using dozens of CPUs). This PR would allow to speedup the gaussian mixture step. Thanks
@npatki not sure what you want to do with this?
Meanwhile the library code has changed so the PR should be updated.
For example, the _fit_continuous method now receives a pandas DataFrame, so np.random.choice() can be replaced by something like data = data.sample(self._max_gm_samples, replace=False, random_state=SEED).
Also, I wonder if ClusterBasedNormalizer could not be optionally replaced by a power transform, which might be faster (although it might impact the quality of the generated data), see https://github.com/sdv-dev/RDT/issues/613