firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Support training continuation for the failed or preempted tasks

Open eu9ene opened this issue 1 year ago • 7 comments

It's especially important for pre-emption of spot instances

eu9ene avatar Nov 20 '23 18:11 eu9ene

The work that @gabrielBusta is doing in #226 will be a good basis for this. The difference with spot terminations is that the tasks will automatically rerun, and we won't be able to adjust the parameters. We'll need some sort of enhancement to detect that case and automatically find the previous attempt's artifacts. Eg: a check at the start of the job if this is run#1 or higher, and if it is, attempt to pull artifacts from the previous run. (We might want more sanity checking in there, too.)

bhearsum avatar Nov 20 '23 19:11 bhearsum

(copying over my thoughts from #315).

For reference, this is the definition of a preemtible instance.

During the Catalan run the teacher training would often take 2 or 3 times as long to run to completion since the task would get preempted. Here is an example profile of several preemption tasks happening: https://share.firefox.dev/3Rw5u5g

Also, I believe @bhearsum said that this was a blocker for this work: https://mozilla-hub.atlassian.net/browse/RELOPS-782

gregtatum avatar Dec 18 '23 20:12 gregtatum

https://mozilla-hub.atlassian.net/browse/RELOPS-782 is in progress (https://github.com/mozilla-platform-ops/monopacker/pull/121).

marco-c avatar Dec 20 '23 14:12 marco-c

As of today, all of the instances we use in Taskcluster should handle spot terminations gracefully, and publish whatever artifacts exist at the time of the shutdown.

bhearsum avatar Jan 11 '24 15:01 bhearsum

The next step here is to load in the previous artifacts and restart the training.

gregtatum avatar Jan 12 '24 20:01 gregtatum

This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?

gabrielBusta avatar Jan 26 '24 20:01 gabrielBusta

This should be possible now because the training artifacts are being published when an instance is preempted. AIUI, as long as the task produces the training artifacts using pretrained models should work. I wish there was a way we could try it out, but how does one simulate a spot termination?

I learned recently that you can just press the "stop" button in the GCP console to do this :). (I believe you have the necessary access to do so.)

bhearsum avatar Feb 14 '24 20:02 bhearsum