RDT
RDT copied to clipboard
Adding a seed argument to the ClusterBasedNormalizer
Problem Description
There is some randomness in fitting the ClusterBasedNormalizer. This also causes reproducibility issues in the other sdv libraries, e.g., https://github.com/sdv-dev/CTGAN/issues/213.
Expected behavior
The BayesianGaussianMixture used to fit the distribution has a random_state argument that could be used for reproducibility purposes (see https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html).
Additional context
I have only looked at the ClusterBasedNormalizer, but it may be that other methods could use the same approach for reproducibility purposes.