SDGym
SDGym copied to clipboard
[Gretel] Optimize config and training parameters for gretel-synthetics
Problem Description
Some of the parameters in the gretel-synthetics implementation in SDGym can cause the model to fail during evaluation, and can be optimized for generating synthetic data (details below).
Expected behavior
In sdgym/synthesizers/gretel.py
there are a few updates I'd recommend making:
- Add
learning_rate
as a parameter, set default to0.001
as per Gretel docs. - Add
field_cluster_size
as a tunable parameter - on
batcher.generate_all_batch_lines
, set a defaultmax_invalid
. For larger datasets, the default value of1000
can cause the model to unnecessarily terminate and during sampling. -
epochs
can be set to a standard value (e.g. 100), no need to set epochs based on the number of columns. Early stopping and a validation set can be used to prevent overfitting.
Additional context
I'm happy to submit a PR with fixes and to compare against baseline config for tests, let me know if this would be okay. Cheers!
Hi @zredlined, thanks for taking a look at the Gretel implementation in SDGym! Please feel free to submit a PR with the updates you have proposed. You can link it to this issue and we'll be happy to take a look.