OpenNMT-py
OpenNMT-py copied to clipboard
Fix error to load data at the correct position when resuming from a checkpoint
This PR contains a mechanism to resume a training from the positions in corpora. The idea is to use a cursor for each corpus and save its text line (the batch variable cid_line_number
) to the saved checkpoint file.
The following features are implemented:
- Adding a new parameter
resume_from_corpora
: whenTrue
, the training will try to resume from the last text line of each corpus. Otherwise, the training will resume from the beginning of all corpora. - Update the calculation of
cid_line_number
to get the text line number directly from theexfile_open
function. - Conditions to resume the training from the saved text lines:
- The last text lines of all corpora must be saved in the checkpoint (for backward compatibility with existing versions.)
- All corpus names in the config and in the saved checkpoint must match.
- Quick checksum: for each corpus in the config, its saved text line cannot exceed its total number of lines.
- Communication between the trainer and model saver to handle corpus cursors.
The following scenarios are tested:
- [x] Backward compatibility test: resume from beginning when using a checkpoint of existing version (with no saved text line.)
- [x] Resume from a saved checkpoint with saved text lines :
- [x] When
resume_from_corpora
=True- [x] Some corpora in the config do not match (resume from beginning.)
- [x] Some saved text lines exceed the total number of text line (resume from beginning.)
- [x] All check are passed (resume from saved text file.)
- [x] When
resume_from_corpora
=False (resume from beginning.)
- [x] When
This is doing the same thing as what is described here: https://github.com/OpenNMT/OpenNMT-py/issues/2006#issuecomment-1406570959 the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.
This is doing the same thing as what is described here: #2006 (comment) the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.
Thanks @vince62s! The code is updated. Could you have a look and merge to the main code base ?