firefox-translations-training GPU workers still not always handling preemptions properly

We recently upgraded worker-runner on the GPU workers to a version that is supposed to gracefully handle spot preemptions. Most notably, it should be uploading artifacts before an instance terminates.

I need to do some more digging as to what's going on here.

Jan 26 '24 19:01 bhearsum

We appear to have been running an up-to-date enough worker-runner:

$ start-worker --version 2024/01/26 20:15:43 Error disabling OOM killer for the start-worker process: write /proc/1166/oom_adj: permission denied start-worker 59.1.3

(This improvement was made in https://github.com/taskcluster/taskcluster/issues/6530, and released in 55.1.1.)

I do see what appears to be the worker noticing the termination notice:

Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 GCP Metadata Service says termination is imminent Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 Got graceful-termination request with finish-tasks=false Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 Killing process tree with parent PID 1637... (0xc000308a20) Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 Process tree with parent PID 1637 killed. Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 WARNING: no such process Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 Killing process tree with parent PID 1638... (0xc000308b70) Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q-1 start-worker 2024/01/25 11:17:11 Process tree with parent PID 1638 killed. Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 polling for termination-time Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 GCP Metadata Service says termination is imminent Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 Got graceful-termination request with finish-tasks=false Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 Killing process tree with parent PID 1637... (0xc000308a20) Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 Process tree with parent PID 1637 killed. Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 WARNING: no such process Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 Killing process tree with parent PID 1638... (0xc000308b70) Jan 25 11:17:12Z translations-1-b-linux-v100-gpu-4-300g-zlyc9cmhtlqiaqxuzsym0q start-worker 2024/01/25 11:17:11 Process tree with parent PID 1638 killed.

Jan 26 '24 20:01 bhearsum

Curiously, https://firefox-ci-tc.services.mozilla.com/tasks/GYYVwr5RS-61EM8otXtMNg/runs/0 reports CLAIM_EXPIRED. https://github.com/taskcluster/taskcluster/blob/892c07ad0d6a8a5eecbfa704fabef4a17cc11581/workers/generic-worker/main.go#L520-L522 seems to suggest this should be WORKER_SHUTDOWN.

Jan 26 '24 20:01 bhearsum

I've opened https://github.com/taskcluster/taskcluster/issues/6802 on the Taskcluster side for this. It's not clear to me whether it's my expectations that are wrong here, or there's a bug somewhere.

Jan 26 '24 20:01 bhearsum

Another interesting thing from logs is this case, where we're polling every 30 seconds, and then 8 seconds after a poll the system starts shutting down:

 Jan 30 10:32:00Z translations-1-b-linux-v100-gpu-4-1tb-fqyy7djjszsfazxznsdfna-1 start-worker 2024/01/30 10:32:00 polling for termination-time
Jan 30 10:32:30Z translations-1-b-linux-v100-gpu-4-1tb-fqyy7djjszsfazxznsdfna-1 start-worker 2024/01/30 10:32:30 polling for termination-time
Jan 30 10:33:00Z translations-1-b-linux-v100-gpu-4-1tb-fqyy7djjszsfazxznsdfna-1 start-worker 2024/01/30 10:33:00 polling for termination-time
Jan 30 10:33:08Z translations-1-b-linux-v100-gpu-4-1tb-fqyy7djjszsfazxznsdfna-1 systemd-logind Power key pressed.
Jan 30 10:33:08Z translations-1-b-linux-v100-gpu-4-1tb-fqyy7djjszsfazxznsdfna-1 systemd-logind Powering Off...
Jan 30 10:33:09Z translations-1-b-linux-v100-gpu-4-1tb-fqyy7djjszsfazxznsdfna-1 systemd-logind System is powering down.

That seems to suggest that GCP didn't actually publish the pre-emption notice, or that worker-runner isn't finding it correctly.

The GCP docs on spot termination don't mention anything about using the metadata service either - they talk about "ACPI G2 Soft Off" as the mechanism for signaling a preemption.

Jan 30 '24 15:01 bhearsum

We've made a number of improvements in the past few months on the worker side. We now notice and respond to spot termination notices immediately, and we upload all artifacts in parallel. In all of the tests that I've done, terminations have been noticed and all present artifacts have been uploaded.

I'm going to call this fixed, and we can re-open or file a new issue if we have additional problems in the future.

May 09 '24 13:05 bhearsum

firefox-translations-training firefox-translations-training copied to clipboard

GPU workers still not always handling preemptions properly

firefox-translations-training
firefox-translations-training copied to clipboard