firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Add support for automatically continuing training from earlier runs of a Task (fixes #270)

Open bhearsum opened this issue 9 months ago • 0 comments

Aside from the included tests, I tested this by hand. What I did was:

  • Start a train-backwards job: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/0
  • Wait for it to checkpoint
  • Simulate a spot termination, which was handled correctly, and ended up with 20 artifacts
  • The Task automatically reran: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/1, and picked up the previous artifacts. From the log:
[task 2024-05-07T20:54:35.749Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-05-07T20:54:35.880Z] INFO:root:Run 0 appears to have the artifacts we need! Downloading them...
[task 2024-05-07T20:54:35.880Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-05-07T20:54:36.123Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-05-07T20:54:36.319Z] INFO:root:Fetching public/build/devset.out...
[task 2024-05-07T20:54:36.584Z] INFO:root:Fetching public/build/model.npz...
[task 2024-05-07T20:54:38.649Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-05-07T20:54:40.588Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-05-07T20:54:40.770Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-05-07T20:54:42.583Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-05-07T20:54:42.821Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-05-07T20:54:44.517Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-05-07T20:54:44.758Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-05-07T20:54:44.990Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-05-07T20:54:51.438Z] INFO:root:Fetching public/build/model.npz.progress.yml...
[task 2024-05-07T20:54:51.634Z] INFO:root:Fetching public/build/model.npz.yml...
[task 2024-05-07T20:54:51.834Z] INFO:root:Fetching public/build/opustrainer.log...
[task 2024-05-07T20:54:52.042Z] INFO:root:Fetching public/build/train.log...
[task 2024-05-07T20:54:52.243Z] INFO:root:Fetching public/build/valid.log...
[task 2024-05-07T20:54:52.458Z] INFO:root:Fetching public/build/vocab.spm...
<a bit further down>
[task 2024-05-07T20:57:19.459Z] [2024-05-07 20:57:19] [training] Master parameters and optimizers restored from training checkpoint /home/ubuntu/tasks/task_171511512470744/artifacts/model.npz and /home/ubuntu/tasks/task_171511512470744/artifacts/model.npz.optimizer.npz
  • I simulated a spot termination on that run quite quickly, and it uploaded the artifacts it had just downloaded. (These get downloaded to the artifacts directory when we continue training, so even though this run didn't checkpoint, it essentially just re-uploaded run #0's work).
  • Run #2 also started automatically: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/2, and picked up the artifacts from run #1. Again from the log:
[task 2024-05-07T21:04:06.641Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-05-07T21:04:06.718Z] INFO:root:Run 1 appears to have the artifacts we need! Downloading them...
[task 2024-05-07T21:04:06.718Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-05-07T21:04:06.955Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-05-07T21:04:07.181Z] INFO:root:Fetching public/build/devset.out...
[task 2024-05-07T21:04:07.433Z] INFO:root:Fetching public/build/model.npz...
[task 2024-05-07T21:04:10.476Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-05-07T21:04:12.597Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-05-07T21:04:12.812Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-05-07T21:04:15.629Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-05-07T21:04:15.833Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-05-07T21:04:19.250Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-05-07T21:04:19.447Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-05-07T21:04:19.660Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-05-07T21:04:28.699Z] INFO:root:Fetching public/build/model.npz.progress.yml...
[task 2024-05-07T21:04:28.941Z] INFO:root:Fetching public/build/model.npz.yml...
[task 2024-05-07T21:04:29.166Z] INFO:root:Fetching public/build/opustrainer.log...
[task 2024-05-07T21:04:29.399Z] INFO:root:Fetching public/build/train.log...
[task 2024-05-07T21:04:29.609Z] INFO:root:Fetching public/build/valid.log...
[task 2024-05-07T21:04:29.842Z] INFO:root:Fetching public/build/vocab.spm...
<a bit further down>
[task 2024-05-07T21:06:50.280Z] [2024-05-07 21:06:50] [training] Master parameters and optimizers restored from training checkpoint /home/ubuntu/tasks/task_171511569156440/artifacts/model.npz and /home/ubuntu/tasks/task_171511569156440/artifacts/model.npz.optimizer.npz

(I ended up canceling run 2 to avoid wasting resources.)

Additional interpretation of the logs and artifacts is welcome; I'm also happy to do more simulated test runs if that's useful.

Landing this depends on an update to the GPU images that includes the latest spot termintation handling code. I expect this to land in the next day.

bhearsum avatar May 08 '24 17:05 bhearsum