cromshell
cromshell copied to clipboard
Feature request: "preempt" job (as opposed to "abort")
@dalessioluca might have input here as well.
I am looking for the ability to programmatically preempt a running job using a cromshell
command. I do not want to abort the job. I want to kill the preemptible machine that the job is running on, and then let the workflow continue.
Why?
I would like to use this command as a way to test what happens to my preemptible jobs if they were to get preempted. The only way I know of to test this in the wild is something like this feature request.
What would this involve?
I don't know exactly how to do this, but it might be some variant of
gcloud compute instances stop VM_NAME
https://cloud.google.com/compute/docs/instances/stop-start-instance#gcloud
It would be great to have cromshell handle figuring out the name of the VM running the job.
What do I currently do to achieve the desired effect?
@dalessioluca taught me: you can use the Google Compute web UI to look for the instance running the Cromwell job based on the Cromwell workflow ID. If you can find the instance, you can use the UI to "stop" it, and this will preempt the job. The problem is, it requires you to have a few windows open, good timing, and a nonzero amount of effort.
This is an interesting idea. @sjfleming How would this work for workflows with many tasks? Do you want to preempt one particular subtask or all subtasks?
@lbergelson What do you think?
It sounds like a niche but potentially useful idea. Not sure how to implement it since I'm assuming there's no hook in cromwell to do it. Cromshell doesn't have access to the underlying google infrastructure without some additional auth.
This really seems like an ideal cromshell 2 plugin rather than a base part of cromshell.
I think Auth should be handled by your environmental settings (we're just calling out to gsutil / gcloud).
Seems like we'd have to parse the metadata to find out what's going on in the workflow and try to identify the machine from there. I'm not sure how to do that with the info in the metadata (or even if we can!) - I haven't looked. Assuming all the info is there it shouldn't really be an issue to do this…
👍 on the Cromshell 2.0 plugin idea, though we may be able to sneak this into this version if someone gets/makes time for it.
Yeah, I guess you're right that gsutil will just handle it. I was thinking that you would need access to the project that cromwell was running under, but in our case we have that.
@lbergelson yes it would be super niche... probably hardly anyone would ever use it
@jonn-smith yeah I think just calling gsutil
is the only way to go here. But you bring up a great point about "which task"... I have been living in the simple world where my workflows are usually just one task. I guess what I'd really want is to be able to specify the task. The task is what I actually would want to preempt, since (probably) you're trying to target one thing which you want to resume upon preemption
Also, I guess this would only work with the google cloud backend at this point? I assume that's fine for now
I am watching a workflow's metadata as it's running, and I do NOT see the machine name (instanceName
) until after the job completes. Is this expected?
while the task is running, I see
"jes": {
"executionBucket": "gs://broad-methods-cromwell-exec-bucket-instance-8",
"endpointUrl": "https://genomics.googleapis.com/",
"googleProject": "broad-dsde-methods"
}
and after the task (or maybe workflow?) is complete, I see
"jes": {
"endpointUrl": "https://genomics.googleapis.com/",
"machineType": "custom-4-15360",
"googleProject": "broad-dsde-methods",
"executionBucket": "gs://broad-methods-cromwell-exec-bucket-instance-8",
"zone": "us-west1-b",
"instanceName": "google-pipelines-worker-3288067dcdcd9fb46a6e21bc7cc00311"
}
(if there's no way to get the instanceName
of a running task, this ruins the approach I had in mind...)
Maybe I would have to resort to something like
gcloud compute instances list --format="table(name,status,tags.list())"
and then comb through the tags, since Cromwell does add a tag with the workflow ID... but that seems less than ideal.
@sjfleming Are the vm's that papi manages actually exposed in any way to us? I hadn't really thought about it, but I don't even know if they're listed in your project?
@lbergelson they are! Yeah it's kinda cool actually, if we run something from the methods Cromwell server for example, then if you go to the Google Cloud Compute Engine Console in a browser, you can see that the Cromwell jobs are running on VMs whose names start with google-pipelines-worker-
. And, very helpfully, they have labels with the Cromwell workflow_id
and task
. I don't know if every Cromwell instance applies those helpful labels to machines it spins up, or if that's some nice feature that somebody added to the methods Cromwell server.
If I can always count on those labels being there, then I can run this
gcloud compute instances list --filter='labels.cromwell-workflow-id:cromwell-{WORKFLOW_ID} labels.wdl-task-name:{TASK}' --format 'table(name)'
to get the name of the instance I want to stop, and then I can stop it with
gcloud compute instances stop {INSTANCE_NAME}