CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

Doubts on the usage of conditional sampling

Open tonydp03 opened this issue 1 year ago • 3 comments

Environment details

If you are already running CTGAN, please indicate the following details about the environment in which you are running it:

  • CTGAN version: 0.7.5
  • Python version: 3.10.11
  • Operating System: MacOS 14.0 Sonoma

Problem description

I'm trying to generate samples from the example dataset adult.csv, conditioned on the column "sex" with value "Female". however it doesn't seem to work.

What I already tried

I tried to put the value between ".." or '...', tried with other categories/values, but the result doesn't change. The one-hot-vector that is generated only contains zeros.

The command is the following:

python ctgan/__main__.py examples/csv/adult.csv examples/csv/synthetic_adult_cond.csv --save test_model_cond.p -d workclass,education,marital-status,occupation,relationship,race,sex,native-country,income --verbose --epochs 10 --sample_condition_column sex --sample_condition_column_value Female

(note that the number of epochs is very low just for testing the command and reproducing the error). The traceback is the following:

Traceback (most recent call last):
  File "[...]/CTGAN/ctgan/__main__.py", line 103, in <module>
    main()
  File "[...]/CTGAN/ctgan/__main__.py", line 91, in main
    sampled = model.sample(
  File "[...]/ctgan/lib/python3.10/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
    return function(self, *args, **kwargs)
  File "[...]/ctgan/lib/python3.10/site-packages/ctgan/synthesizers/ctgan.py", line 465, in sample
    condition_info = self._transformer.convert_column_name_value_to_id(
  File "[...]/ctgan/lib/python3.10/site-packages/ctgan/data_transformer.py", line 260, in convert_column_name_value_to_id
    raise ValueError(f"The value `{value}` doesn't exist in the column `{column_name}`.")
ValueError: The value `Female` doesn't exist in the column `sex`.

Any hint? Am I using it wrong?

tonydp03 avatar Oct 31 '23 17:10 tonydp03

Hi @tonydp03,

Nice to meet you. From looking at the raw CSV of input data, it seems that there is a leading space before every value. So in this case, the value you're conditioning on should be " Female" (with the leading space) instead of Female (with no space).

BTW if your project allows for it, I would recommend accessing the CTGAN model through the SDV library. The SDV is a publicly available Python SDK that allows you to generate synthetic data using a variety of synthesizers such as CTGAN. It also providers convenient wrappers for data pre- and post-processing, should you want to modify that. And you can use conditional sampling with it too.

Some resources:

npatki avatar Oct 31 '23 21:10 npatki

Hi @tonydp03,

Nice to meet you. From looking at the raw CSV of input data, it seems that there is a leading space before every value. So in this case, the value you're conditioning on should be " Female" (with the leading space) instead of Female (with no space).

BTW if your project allows for it, I would recommend accessing the CTGAN model through the SDV library. The SDV is a publicly available Python SDK that allows you to generate synthetic data using a variety of synthesizers such as CTGAN. It also providers convenient wrappers for data pre- and post-processing, should you want to modify that. And you can use conditional sampling with it too.

Some resources:

Hi @npatki,

thanks for your answer. I simply assumed the test dataset could be used "out-of-the-box" and didn't notice the leading space at the beginning of the column value. I will give it another try, for sure.

Thanks for the resources too. For the moment, we were just testing the usage of CTGAN to generate synthetic data, as we were positively impressed by the results shown in the paper. In parallel, we're also testing the usage of the SDV library, as it seems an interesting tool.

tonydp03 avatar Nov 02 '23 09:11 tonydp03

One more thing: is it correct that, in the main.py, the function fit is called even when the model is loaded? I was expecting for it to be called only when the model has not been trained yet and I'm creating a new one.

tonydp03 avatar Nov 02 '23 14:11 tonydp03

Hi @tonydp03 my apologies for getting this reply so late.

The current recommended approach is to use CTGAN via the SDV library as described above. I can answer your usage questions and help you troubleshoot any issues with your project.

Unfortunately I'm unable to go through any detailed lines of code with you. Please also note that some code in the repo may be deprecated or unsupported so I would always recommend the docs for the latest supported usage.

Thanks and please feel free to file a new issue with additional questions or feature requests.

npatki avatar Apr 17 '24 03:04 npatki