CTGAN
CTGAN copied to clipboard
[HELP] CTGAN has Reproducibility?
Environment details
If you are already running CTGAN, please indicate the following details about the environment in which you are running it:
- CTGAN version: 0.10.0
- Python version: 3.9.5
- Operating System: ubuntu 20.04
Problem description
from ctgan import CTGAN
from ctgan import load_demo
real_data = load_demo()
# Names of the columns that are discrete
discrete_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)
ctgan.fit(real_data, discrete_columns)
# set seed
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
SEED_VALUE = 42
np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)
# Create synthetic data
#ctgan.set_random_state(123)
synthetic_data1 = ctgan.sample(1000)
#ctgan.set_random_state(123)
synthetic_data2 = ctgan.sample(1000)
# ctgan.set_random_state(123)
# synthetic_data1 & synthetic_data2 comparison
if np.array_equal(synthetic_data1, synthetic_data2):
print("synthetic_data1 & synthetic_data2 is equal.")
else:
print("synthetic_data1 & synthetic_data2 is not equal.")
i tried this thousand times but .. still synthetic_data1 & synthetic_data2 is not equal.
Hi there @limhasic I'm not able to reproduce this. With both 1 and 10 epochs, I was able to generate the same exact data from 2 different CTGAN models.
from ctgan import CTGAN
from ctgan import load_demo
real_data = load_demo()
# Names of the columns that are discrete
discrete_columns = [
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)
ctgan.fit(real_data, discrete_columns)
ctgan2 = CTGAN(epochs=1, verbose = True)
ctgan2.set_random_state(123)
ctgan2.fit(real_data, discrete_columns)
a = ctgan.sample(100)
b = ctgan2.sample(100)
a.equals(b)
^ The last line returns True and you can also visually inspect and see that the datasets are the same.
Is it possible to share the environment? Damn I got false again
i have ran on
python 3.8.10
ctgan 0.9.1
numpy 1.24.4
torch 1.10.1+cu111
ubuntu 20.04...
I ran my code in Google Colab: https://colab.research.google.com/
Python 3.10.12
ctgan 0.10.0
numpy 1.25.2
torch 2.2.1
Ubuntu 18.04.3 LTS (I believe, based on what Google said for Colab)
A few things to consider:
- Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly?
- When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc
@limhasic after some more investigation, it turns out we actually don't support reproducibility when fitting a synthesizer. The reproducibility we do support right now is only during sampling (generating 2 samples from the same synthesizer with the same random state).
Out of curiosity, what's the motivation to have reproducibility during model fitting itself?
@srinify I am working on synthetic data.
Therefore, there is a lot of interest in evaluation indicators and generation methods between original data and synthetic data.
However, when generating data with CTGAN for evaluation, different results were obtained each time.
Since the sample did not show reproducibility, I started thinking about seed control for fitting.
Since it is still morning, I will test it in the Colab environment you sent.
also,
-
Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly? -> I tried both while changing environments.
-
When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc -> First of all, I think it is different if the specific rows are different.
Close by checking sampling reproducibility in the latest version of CTGANSynthesizer.
Reproducibility is visible in simple data, but when the number of columns increases to more than 25, reproducibility is lost. When I wake up, I observe the phenomenon of the generator emitting different data.
Thanks for sharing context into your use case @limhasic I've opened this feature request to add reproducibility at the model fitting level with your use case: https://github.com/sdv-dev/SDV/issues/2022
DataCebo is a very small team and we use community interest to help us prioritize what to work on! So we hope more people will add their use cases to that issue over time.
Closing this issue out as software is working as intended right now.