OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

How to reproduce the same training process when using "train_from"

Open lemon234071 opened this issue 4 years ago • 4 comments

Dear,

When the model training was forced to stop due to an accident. I use the opt "train_from" to continue training from the checkpoint. But the result is different from the tranining from start to finish without stopping:

  1. The stored patience for "early stop" was not saved into checkpoint.
  2. The order of data batch provied train_iter is different, when train_from a checkpoint. (When train_from, it starts over from the begining of the dataset and the data are very different from where it stands at the step of saved checkpoint)

Note that i fixed all random seed.

So it is very convenient that If a reproduction mechanism can be added into the code base. Any help will be greatly appreciated.

lemon234071 avatar Feb 02 '21 03:02 lemon234071

This is roughly what I intended with #1826, but it's not compatible with all the changes we introduced in 2.0. It should be possible to introduce such a mechanism though, that would store some counter to keep track of where we are in each dataset. It would never be perfect though, as there is quite a gap between when the data is read and when it's indeed seen in a training batch, because of the pooling mechanism.

francoishernandez avatar Feb 02 '21 08:02 francoishernandez

Thanks for your efforts.

lemon234071 avatar Feb 02 '21 11:02 lemon234071

Regarding this issue I implemented the following: new option -dryrun_steps xxxxx which would batch during xxxxx steps without actually training, then start trianing at xxxxx+1 That would restart the training at the exact point in the data where it stopped. The only issue is that it is very very slow to reach xxxxx+1 If there is a better idea other than storing the index in each dataset.

vince62s avatar Jan 27 '23 14:01 vince62s

I made an attempt for this feature in this PR https://github.com/OpenNMT/OpenNMT-py/pull/2520. The idea is to skip to the saved text line in each corpus when a training is resumed.

PC91 avatar May 04 '24 22:05 PC91