CTGAN [HELP] CTGAN has Reproducibility?

Environment details

If you are already running CTGAN, please indicate the following details about the environment in which you are running it:

CTGAN version: 0.10.0
Python version: 3.9.5
Operating System: ubuntu 20.04

Problem description

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)

ctgan.fit(real_data, discrete_columns)

# set seed
seed = 42

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) 

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

SEED_VALUE = 42

np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)

# Create synthetic data
#ctgan.set_random_state(123)
synthetic_data1 = ctgan.sample(1000)
#ctgan.set_random_state(123)
synthetic_data2 = ctgan.sample(1000)
# ctgan.set_random_state(123) 

# synthetic_data1 & synthetic_data2 comparison
if np.array_equal(synthetic_data1, synthetic_data2):
    print("synthetic_data1 & synthetic_data2 is equal.")
else:
    print("synthetic_data1 & synthetic_data2 is not equal.")

i tried this thousand times but .. still synthetic_data1 & synthetic_data2 is not equal.

May 08 '24 00:05 limhasic

Hi there @limhasic I'm not able to reproduce this. With both 1 and 10 epochs, I was able to generate the same exact data from 2 different CTGAN models.

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)
ctgan.fit(real_data, discrete_columns)

ctgan2 = CTGAN(epochs=1, verbose = True)
ctgan2.set_random_state(123)
ctgan2.fit(real_data, discrete_columns)

a = ctgan.sample(100)
b = ctgan2.sample(100)

a.equals(b)

^ The last line returns True and you can also visually inspect and see that the datasets are the same.

May 09 '24 19:05 srinify

Is it possible to share the environment? Damn I got false again

i have ran on

python 3.8.10
ctgan 0.9.1
numpy 1.24.4
torch  1.10.1+cu111 
ubuntu 20.04...

May 10 '24 00:05 limhasic

I ran my code in Google Colab: https://colab.research.google.com/

Python 3.10.12
ctgan 0.10.0
numpy 1.25.2
torch 2.2.1
Ubuntu 18.04.3 LTS (I believe, based on what Google said for Colab)

A few things to consider:

Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly?
When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc

May 13 '24 13:05 srinify

@limhasic after some more investigation, it turns out we actually don't support reproducibility when fitting a synthesizer. The reproducibility we do support right now is only during sampling (generating 2 samples from the same synthesizer with the same random state).

Out of curiosity, what's the motivation to have reproducibility during model fitting itself?

May 13 '24 18:05 srinify

@srinify I am working on synthetic data.

Therefore, there is a lot of interest in evaluation indicators and generation methods between original data and synthetic data.

However, when generating data with CTGAN for evaluation, different results were obtained each time.

Since the sample did not show reproducibility, I started thinking about seed control for fitting.

Since it is still morning, I will test it in the Colab environment you sent.

also,

Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly? -> I tried both while changing environments.
When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc -> First of all, I think it is different if the specific rows are different.

May 14 '24 00:05 limhasic

Close by checking sampling reproducibility in the latest version of CTGANSynthesizer.

May 14 '24 01:05 limhasic

Reproducibility is visible in simple data, but when the number of columns increases to more than 25, reproducibility is lost. When I wake up, I observe the phenomenon of the generator emitting different data.

May 16 '24 00:05 limhasic

Thanks for sharing context into your use case @limhasic I've opened this feature request to add reproducibility at the model fitting level with your use case: https://github.com/sdv-dev/SDV/issues/2022

DataCebo is a very small team and we use community interest to help us prioritize what to work on! So we hope more people will add their use cases to that issue over time.

Closing this issue out as software is working as intended right now.

May 21 '24 13:05 srinify

CTGAN CTGAN copied to clipboard

[HELP] CTGAN has Reproducibility?

Environment details

Problem description

CTGAN
CTGAN copied to clipboard