firefox-translations-training
firefox-translations-training copied to clipboard
Support training continuation for the failed or preempted tasks
It's especially important for pre-emption of spot instances
The work that @gabrielBusta is doing in #226 will be a good basis for this. The difference with spot terminations is that the tasks will automatically rerun, and we won't be able to adjust the parameters. We'll need some sort of enhancement to detect that case and automatically find the previous attempt's artifacts. Eg: a check at the start of the job if this is run#1 or higher, and if it is, attempt to pull artifacts from the previous run. (We might want more sanity checking in there, too.)
(copying over my thoughts from #315).
For reference, this is the definition of a preemtible instance.
During the Catalan run the teacher training would often take 2 or 3 times as long to run to completion since the task would get preempted. Here is an example profile of several preemption tasks happening: https://share.firefox.dev/3Rw5u5g
Also, I believe @bhearsum said that this was a blocker for this work: https://mozilla-hub.atlassian.net/browse/RELOPS-782
https://mozilla-hub.atlassian.net/browse/RELOPS-782 is in progress (https://github.com/mozilla-platform-ops/monopacker/pull/121).
As of today, all of the instances we use in Taskcluster should handle spot terminations gracefully, and publish whatever artifacts exist at the time of the shutdown.
The next step here is to load in the previous artifacts and restart the training.
This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?
This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?
I learned recently that you can just press the "stop" button in the GCP console to do this :). (I believe you have the necessary access to do so.)