SDGym
SDGym copied to clipboard
[Gretel] Optimize config and training parameters for gretel-synthetics
Problem Description
Some of the parameters in the gretel-synthetics implementation in SDGym can cause the model to fail during evaluation, and can be optimized for generating synthetic data (details below).
Expected behavior
In sdgym/synthesizers/gretel.py there are a few updates I'd recommend making:
- Add
learning_rateas a parameter, set default to0.001as per Gretel docs. - Add
field_cluster_sizeas a tunable parameter - on
batcher.generate_all_batch_lines, set a defaultmax_invalid. For larger datasets, the default value of1000can cause the model to unnecessarily terminate and during sampling. epochscan be set to a standard value (e.g. 100), no need to set epochs based on the number of columns. Early stopping and a validation set can be used to prevent overfitting.
Additional context
I'm happy to submit a PR with fixes and to compare against baseline config for tests, let me know if this would be okay. Cheers!
Hi @zredlined, thanks for taking a look at the Gretel implementation in SDGym! Please feel free to submit a PR with the updates you have proposed. You can link it to this issue and we'll be happy to take a look.