OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

Fix error to load data at the correct position when resuming from a checkpoint

Open PC91 opened this issue 1 year ago • 2 comments

This PR contains a mechanism to resume a training from the positions in corpora. The idea is to use a cursor for each corpus and save its text line (the batch variable cid_line_number) to the saved checkpoint file.

The following features are implemented:

  • Adding a new parameter resume_from_corpora: when True, the training will try to resume from the last text line of each corpus. Otherwise, the training will resume from the beginning of all corpora.
  • Update the calculation of cid_line_number to get the text line number directly from the exfile_open function.
  • Conditions to resume the training from the saved text lines:
    • The last text lines of all corpora must be saved in the checkpoint (for backward compatibility with existing versions.)
    • All corpus names in the config and in the saved checkpoint must match.
    • Quick checksum: for each corpus in the config, its saved text line cannot exceed its total number of lines.
  • Communication between the trainer and model saver to handle corpus cursors.

The following scenarios are tested:

  • [x] Backward compatibility test: resume from beginning when using a checkpoint of existing version (with no saved text line.)
  • [x] Resume from a saved checkpoint with saved text lines :
    • [x] When resume_from_corpora=True
      • [x] Some corpora in the config do not match (resume from beginning.)
      • [x] Some saved text lines exceed the total number of text line (resume from beginning.)
      • [x] All check are passed (resume from saved text file.)
    • [x] When resume_from_corpora=False (resume from beginning.)

PC91 avatar Nov 19 '23 19:11 PC91

This is doing the same thing as what is described here: https://github.com/OpenNMT/OpenNMT-py/issues/2006#issuecomment-1406570959 the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.

vince62s avatar Nov 20 '23 07:11 vince62s

This is doing the same thing as what is described here: #2006 (comment) the issue is that if checkpoint is at 250 000 steps and you want to continue it takes way too long to iterate over those batches. THis is the reason why memorizing the index of each dataset and setting the cursor at this index is more efficient.

Thanks @vince62s! The code is updated. Could you have a look and merge to the main code base ?

PC91 avatar Mar 31 '24 20:03 PC91