CTGAN
CTGAN copied to clipboard
TypeError while ctgan.fit()
Environment Details
Google Colab
Error Description
TypeError Traceback (most recent call last)
6 frames /usr/local/lib/python3.10/dist-packages/rdt/transformers/base.py in _set_seed(self, data) 365 hash_value = self.columns[0] 366 for value in data.head(5): --> 367 hash_value += str(value) 368 369 hash_value = int(hashlib.sha256(hash_value.encode('utf-8')).hexdigest(), 16)
TypeError: unsupported operand type(s) for +=: 'int' and 'str'
Steps to reproduce
!pip install ctgan from ctgan import CTGAN data = pd.read_csv(...) ctgan = CTGAN(epochs=100) ctgan.fit(data)
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
I was facing the same problem. There may be a problem with your column names, they should be strings.
Hi @AT9991 and @aarishmaqsood, would either of you be able to share some CSV data that we can use to replicate this?
BTW instead of using the CTGAN library directly, I would highly recommend using the SDV library. You can access the CTGAN Synthesizer via SDV. Doing so will allow you to make use of additional features -- such as better data pre-processing, customizations such as constraints, and conditional sampling.
I actually wonder whether you would still encounter this bug in SDV, since there is a lot more data validation and checking we do there. Here is a tutorial that uses CTGAN via the SDV library.
@npatki Thank you for your response. I have fixed my problem. In the future I will use your suggested solution.
Great to hear @aarishmaqsood. Could you describe what fixed your problem? In case other others have the same issue, I can refer them here. Thanks.
@npatki Here is the Colab link, where I have replicated the error and provided the solution as well. This problem occurs in version 1.5.0. Below are the code snippets that illustrate both the problem and the solution.
Reproducing the Error
!pip install sdv==1.5.0
import numpy as np
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer
# Generate sample data
num_rows = 100
num_cols = 20
data = {i+1: np.random.randint(0, 100, size=num_rows) for i in range(num_cols)}
df = pd.DataFrame(data)
# create metadata from the DataFrame
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
# Initialize the synthesizer (this is where the error occurs)
synthesizer = CTGANSynthesizer(metadata=metadata)
Solution
# Convert column names to strings
df.columns = ['col_' + str(i) for i in range(1, len(df.columns) + 1)]
# Re-create metadata for the table with corrected column names
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
# Initialize the synthesizer with corrected metadata
synthesizer = CTGANSynthesizer(metadata=metadata)
Hi @aarishmaqsood, very much appreciate the detailed code and notebook.
Note that I have replicated this issue on the latest SDV (1.12.0) also. Here are a few things I discovered:
- The metadata auto-detection no longer works on SDV 1.12.0. I have filed an issue for it at SDV #1933
- The
fit
problem isn't isolated to CTGAN. None of the SDV synthesizers work with this type of data and all produce the same error. I have filed a generic issue at SDV #1935
Since we now have the above two issues filed in our main SDV library, I will mark this one as a duplicate.
In the meantime, for anyone else running into the issue, I suggest using @aarishmaqsood 's simple workaround that converts the column names from integers to strings.
Thanks all for helping uncover this. For any related discussion, please feel free to comment on either of the SDV issues linked above.