Consider initiating generator with synthesizer
Not too familiar with pytorch so let me know if this makes sense...
It seems like we're instantiating the model each time fit() is called https://github.com/DAI-Lab/CTGAN/blob/7aa29685045ffdba84bd87432354c133e05699e6/ctgan/ctgan_model.py#L458-L465
Would it make sense to do this once when we instantiate CTGANSynthesizer so we can e.g. look at the behavior of generated data as we train for more epochs?
Both have advantages. With the parameters in fit, you can more easily change out the training parameters and try different things. But your point is very valid as well.
I like the idea behind your suggestion, @kevinykuo , of moving the epochs argument to the fitmethod and allowing one to resume a previous fitting process (see https://github.com/DAI-Lab/CTGAN/issues/5#issuecomment-558707986).
However, doing this is a bit more complex than it looks, because the Generator instance needs to be passed the data_dim argument, which is deduced from the data that is currently only known during fit.
This means that we cannot simply move the creation of this instance to the __init__ method, but rather figure out another way to implement a "warm start" behavior.
One option would be still create all the model instances inside the fit method, but only do it if they do not exist beforehand.
However, if this is done, some checks need to be also added to make sure that the data which is passed to second fit calls is still compatible with the model instances (if not the same).
Got it, sounds like there a couple ways to proceed, dictated by what you think a "model" represents, i.e., if it should be identified with the metadata of a dataset.
- (The option you mention above) Instantiate the generator at the first
fitcall and cache it. - Require the user to pass some sort of metadata or a sample dataset at model instantiation.
This is now mostly solved, correct? I think this can be closed.