RDT icon indicating copy to clipboard operation
RDT copied to clipboard

Adding a seed argument to the ClusterBasedNormalizer

Open AndresAlgaba opened this issue 3 years ago • 0 comments

Problem Description

There is some randomness in fitting the ClusterBasedNormalizer. This also causes reproducibility issues in the other sdv libraries, e.g., https://github.com/sdv-dev/CTGAN/issues/213.

Expected behavior

The BayesianGaussianMixture used to fit the distribution has a random_state argument that could be used for reproducibility purposes (see https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html).

Additional context

I have only looked at the ClusterBasedNormalizer, but it may be that other methods could use the same approach for reproducibility purposes.

AndresAlgaba avatar Nov 02 '22 12:11 AndresAlgaba