SDGym icon indicating copy to clipboard operation
SDGym copied to clipboard

[Gretel] Optimize config and training parameters for gretel-synthetics

Open zredlined opened this issue 2 years ago • 1 comments

Problem Description

Some of the parameters in the gretel-synthetics implementation in SDGym can cause the model to fail during evaluation, and can be optimized for generating synthetic data (details below).

Expected behavior

In sdgym/synthesizers/gretel.py there are a few updates I'd recommend making:

  • Add learning_rate as a parameter, set default to 0.001 as per Gretel docs.
  • Add field_cluster_size as a tunable parameter
  • on batcher.generate_all_batch_lines, set a default max_invalid. For larger datasets, the default value of 1000 can cause the model to unnecessarily terminate and during sampling.
  • epochs can be set to a standard value (e.g. 100), no need to set epochs based on the number of columns. Early stopping and a validation set can be used to prevent overfitting.

Additional context

I'm happy to submit a PR with fixes and to compare against baseline config for tests, let me know if this would be okay. Cheers!

zredlined avatar May 13 '22 17:05 zredlined

Hi @zredlined, thanks for taking a look at the Gretel implementation in SDGym! Please feel free to submit a PR with the updates you have proposed. You can link it to this issue and we'll be happy to take a look.

katxiao avatar May 18 '22 20:05 katxiao