CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

DataTransformer init parameters

Open FlorentRamb opened this issue 4 years ago • 4 comments

This PR solve issue #7, it allows two things:

  1. ability to fit gaussian mixtures on a subsample (help to scale with big datasets while losing only little accuracy)
  2. ability to pass init arguments to the DataTransformer through CTGANSynthesizer.fit (and so to change other parameters as max_clusters).

FlorentRamb avatar Apr 16 '21 15:04 FlorentRamb

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 16 '21 15:04 CLAassistant

@fealho @pvk-developer Can we merge this? It's basically impossible to fit the CTGAN on a large dataset because the gaussian mixture is a huge bottleneck (even using dozens of CPUs). This PR would allow to speedup the gaussian mixture step. Thanks

candalfigomoro avatar Feb 09 '23 16:02 candalfigomoro

@npatki not sure what you want to do with this?

fealho avatar Feb 09 '23 18:02 fealho

Meanwhile the library code has changed so the PR should be updated.

For example, the _fit_continuous method now receives a pandas DataFrame, so np.random.choice() can be replaced by something like data = data.sample(self._max_gm_samples, replace=False, random_state=SEED).

Also, I wonder if ClusterBasedNormalizer could not be optionally replaced by a power transform, which might be faster (although it might impact the quality of the generated data), see https://github.com/sdv-dev/RDT/issues/613

candalfigomoro avatar Feb 13 '23 13:02 candalfigomoro