xnmt icon indicating copy to clipboard operation
xnmt copied to clipboard

Refactor experiments / resume crashed trainings

Open msperber opened this issue 6 years ago • 5 comments

It would be nice to do the following:

  • Make config files and saved model files completely compatible. They are currently almost but not completely identical: Config files are dictionaries of !Experiment, whereas saved models are a single !Experiment.
  • Keep a state in every component (especially TrainingRegimen, but possibly also preprocessing and evaluation) that is stored as part of the model whenever the model is saved. If at initialization time the state is not zero, we fast-forward to the specified state.
  • write out not only the best DyNet parameters, but also the last parameters.

This would allow the following:

  • Via the state we can resume an experiment from the point it was last saved (in case it crashed, or in case one wants to kill it and resume on a different machine, etc.)
  • Resuming could simply be achieved by running xnmt /location/to/saved_model.mod (if the saved experiment had been completed, this would result in XNMT exiting without doing anything).
  • The ExperimentSeries could be extended to components that do automatic hyperparameter optimization such as Bayesian Optimization etc.

A config file would probably look like this for a config file with a single experiment:

!SimpleExperiment
  preproc: ..
    state: ..
  train: ..
    state: .. # the state is usually not given in the config file, 
                 #but included when the model is written out
  evaluate: ..
  state: ..

Or for a series of experiments (which is more similar to the current config files which always contain a series of experiments):

!ExperimentSeries
  experiments:
  - !SimpleExperiment
      state: ..
  - !SimpleExperiment ..

I believe this would be relatively easy to do, the main thing I'm not sure about how to handle best is that we would no longer have experiment names so {EXP} may no longer work.

msperber avatar Apr 27 '18 06:04 msperber

I agree that this would be nice. And I actually don't understand why we couldn't have experiment names? I think we could have two options for syntax:

!SimpleExperiment
  name: my_name
  ...

or

my_name: !SimpleExperiment
  ...

If we choose the latter, the serializer could check that the top level in the dictionary only has one element and that it is of type experiment.

neubig avatar Apr 27 '18 15:04 neubig

Is this fixed now? I'm not sure...

neubig avatar Jul 17 '18 07:07 neubig

No, I think nothing has been done along these lines yet.

msperber avatar Jul 17 '18 07:07 msperber

Making config files and saved experiments compatible has been implemented by #491.

Some thoughts on what would need to be done to support resuming crashed experiments:

  • make .mod and .data files parallel. Currently there is only one .mod file and several .data dirs because the assumption was that .mod files don't change during the experiment, while the DyNet weights in .data do change. If we keep a state in the experiment, .mod files will also change over the course of the experiment, so we should have one .mod file be assigned to exactly one .data dir.
  • save the experiment at every checkpoint, whether the dev score was improved or not. For this, we should write out best and last versions of the experiment: When resuming, we load last, when loading using !LoadSerialized we load best. When the experiment has finished, last could be deleted.
  • training tasks already keep a training state, but would need to support having this state passed in __init__. For most fields this would be easy, but starting out from the correct sentence and in the correct order requires some thought because checkpoints don't need to correspond to epochs etc.
  • crashes during preproc or evaluate steps are probably not super critical, but would be easy handle. The main problem here could be that writing out the model is a slow right now and shouldn't be done too often.

msperber avatar Aug 03 '18 09:08 msperber

I think having one model per checkpoint is very reasonable. For example, tensorflow also do the same thing. Or if our concern is the disk space, maybe we can add flag to turn off this setting with the consequence that we can't resume the training.

philip30 avatar Aug 03 '18 10:08 philip30