SDGym icon indicating copy to clipboard operation
SDGym copied to clipboard

Error when run custom model using benchmark_single_table

Open T0217 opened this issue 1 year ago • 2 comments

Environment Details

  • SDGym version: 0.8.0
  • Python version: 3.11.5
  • Operating System: Windows 11

Error Description

When running the same code as #321 , the following error was encountered.

image

Steps to reproduce

import os
import shutil
import sdgym
from sdgym import create_single_table_synthesizer
from sdgym.synthesizers import (UniformSynthesizer,
                                GaussianCopulaSynthesizer,
                                TVAESynthesizer)
import warnings
warnings.filterwarnings('ignore')

synthesizers = [
    UniformSynthesizer,
    GaussianCopulaSynthesizer,
    TVAESynthesizer
]


# YData
# CTGAN
def ctgan_get_trained_synthesizer(data, metadata):
    from ydata_synthetic.synthesizers.regular import RegularSynthesizer
    from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

    ctgan_args = ModelParameters(batch_size=500, lr=2e-4, betas=(0.5, 0.9))
    train_args = TrainParameters(epochs=2)

    synthesizer = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)

    num_cols = [col for col, sdtype in metadata['columns'].items() if sdtype['sdtype'] in ['numerical', 'datetime']]
    cat_cols = [col for col, sdtype in metadata['columns'].items() if sdtype['sdtype'] == 'categorical']

    synthesizer.fit(data=data,
                    train_arguments=train_args,
                    num_cols=num_cols,
                    cat_cols=cat_cols)

    return synthesizer


def sample_from_synthesizer(synthesizer, n_rows):
    synthetic_data = synthesizer.sample(n_rows)
    return synthetic_data


YData_CTGANSynthesizer = create_single_table_synthesizer(
    get_trained_synthesizer_fn=ctgan_get_trained_synthesizer,
    sample_from_synthesizer_fn=sample_from_synthesizer,
    display_name='YData-CTGAN'
)


custom_synthesizers = [YData_CTGANSynthesizer]

# Detect the existence of the folder
detailed_results_folder = r"C:\Users\18840\Desktop\result"

if os.path.isdir(detailed_results_folder) and\
   os.path.exists(detailed_results_folder):
    print('The folder where the intermediate files are stored already exists and is processed for deletion.')
    shutil.rmtree(detailed_results_folder, ignore_errors=True)
    print('-' * 50)

results = sdgym.benchmark_single_table(
    synthesizers=synthesizers,
    custom_synthesizers=custom_synthesizers,
    show_progress=True,
    multi_processing_config={
     'package_name': 'multiprocessing',
     'num_workers': 8
    },
    sdv_datasets=['adult'],
    detailed_results_folder=detailed_results_folder
)

T0217 avatar Aug 04 '24 02:08 T0217

Hi there @T0217 👋 Do you mind updating SDGym and related libraries in our ecosystem to see if you're still running into this issue? We released some changes, so I'm always curious to validate if it's still relevant!

Second -- this is a bit challenging for us to debug because we aren't authors of Custom:YData-CTGAN etc. I'm curious if you were able to figure out the source of your error since posting this issue?

srinify avatar Sep 13 '24 00:09 srinify

Thanks for the feedback. I've updated SDGym to test it out. The TypeError issue with the Ydata CTGAN model, caused by weak references, persists. This is likely due to certain attributes or components within the model that use weak references. Switching from pickle to dill for serialization, as suggested in #328, or using the model from the SDV library, can resolve this problem. However, the issue mentioned in #321 remains unresolved, regardless of whether the model from SDV or Ydata is used.

T0217 avatar Sep 13 '24 12:09 T0217