CTGAN icon indicating copy to clipboard operation
CTGAN copied to clipboard

Consider initiating generator with synthesizer

Open kevinykuo opened this issue 6 years ago • 4 comments

Not too familiar with pytorch so let me know if this makes sense...

It seems like we're instantiating the model each time fit() is called https://github.com/DAI-Lab/CTGAN/blob/7aa29685045ffdba84bd87432354c133e05699e6/ctgan/ctgan_model.py#L458-L465 Would it make sense to do this once when we instantiate CTGANSynthesizer so we can e.g. look at the behavior of generated data as we train for more epochs?

kevinykuo avatar Nov 06 '19 16:11 kevinykuo

Both have advantages. With the parameters in fit, you can more easily change out the training parameters and try different things. But your point is very valid as well.

Baukebrenninkmeijer avatar Nov 26 '19 15:11 Baukebrenninkmeijer

I like the idea behind your suggestion, @kevinykuo , of moving the epochs argument to the fitmethod and allowing one to resume a previous fitting process (see https://github.com/DAI-Lab/CTGAN/issues/5#issuecomment-558707986).

However, doing this is a bit more complex than it looks, because the Generator instance needs to be passed the data_dim argument, which is deduced from the data that is currently only known during fit.

This means that we cannot simply move the creation of this instance to the __init__ method, but rather figure out another way to implement a "warm start" behavior.

One option would be still create all the model instances inside the fit method, but only do it if they do not exist beforehand. However, if this is done, some checks need to be also added to make sure that the data which is passed to second fit calls is still compatible with the model instances (if not the same).

csala avatar Nov 26 '19 16:11 csala

Got it, sounds like there a couple ways to proceed, dictated by what you think a "model" represents, i.e., if it should be identified with the metadata of a dataset.

  1. (The option you mention above) Instantiate the generator at the first fit call and cache it.
  2. Require the user to pass some sort of metadata or a sample dataset at model instantiation.

kevinykuo avatar Nov 26 '19 17:11 kevinykuo

This is now mostly solved, correct? I think this can be closed.

Baukebrenninkmeijer avatar Dec 03 '20 12:12 Baukebrenninkmeijer