cromwell
cromwell copied to clipboard
[Enhancement suggestion] Allow for null gpuCount on Google backend
Dear Cromwell dev team,
This is an enhancement suggestion.
When using the google backend for resources allocation, one can specify gpuCount
and gpuType
to request for specific resources. I am currently trying to design a task that optionally needs to access a GPU (function of input/parameters). I tried different approach to dynamically schedule GPUs, but gpuCount
seems constrain to a non-null positive integer.
https://github.com/broadinstitute/cromwell/blob/bfef756ca35b46570dff3fda57f77dd4b2b0d25c/supportedBackends/google/pipelines/common/src/main/scala/cromwell/backend/google/pipelines/common/PipelinesApiRuntimeAttributes.scala#L190
https://github.com/broadinstitute/cromwell/blob/bfef756ca35b46570dff3fda57f77dd4b2b0d25c/supportedBackends/google/pipelines/common/src/main/scala/cromwell/backend/google/pipelines/common/GpuValidation.scala#L28-L40
To allow for dynamic access to GPUs, I propose to extend gpuCount
type to allow for a null value, and to check for a non-null value for resource allocation.
https://github.com/broadinstitute/cromwell/blob/bfef756ca35b46570dff3fda57f77dd4b2b0d25c/supportedBackends/google/pipelines/common/src/main/scala/cromwell/backend/google/pipelines/common/PipelinesApiRuntimeAttributes.scala#L193
Please let me know if such a feature is not desired for any reason.
* I tried accessing the Jira tracker but doesn't have access to Jira on broadworkbench.atlassian.net.
Can you post a minimal example WDL demonstrating the issue?
Sure !
This is a minimal workflow that runs a task with a dynamic number of GPUs
workflow gpu_example {
call maybe_gpu {
input:
gpu_count = 0
}
}
task maybe_gpu {
input {
Int gpu_count
}
command {
echo 1
}
runtime {
docker: "ubuntu:16.04"
gpuCount: gpu_count
gpuType: "nvidia-tesla-t4"
}
}
When ran with gpu_count = 0
, the cromwell runtime validation fails because it is expecting a non-null integer.
2022-02-14 16:48:34,798 cromwell-system-akka.dispatchers.engine-dispatcher-7 INFO - WorkflowExecutionActor-45f6febb-8625-43ce-8bd5-fe0ab71d3fe7 [UUID(45f6febb)]: Starting gpu_example.maybe_gpu
2022-02-14 16:48:39,643 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO - Assigned new job execution tokens to the following groups: 45f6febb: 1
2022-02-14 16:48:41,244 cromwell-system-akka.dispatchers.backend-dispatcher-31 ERROR - Runtime attribute validation failed:
Expecting gpuCount runtime attribute value greater than 0
cromwell.backend.validation.ValidatedRuntimeAttributesBuilder$$anon$1: Runtime attribute validation failed:
Expecting gpuCount runtime attribute value greater than 0
2022-02-14 16:48:42,011 cromwell-system-akka.dispatchers.engine-dispatcher-26 INFO - WorkflowManagerActor: Workflow 45f6febb-8625-43ce-8bd5-fe0ab71d3fe7 failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardSyncExecutionActor$$anonfun$jobFailingDecider$1$$anon$1: PipelinesApiAsyncBackendJobExecutionActor failed and didn't catch its exception. This condition has been handled and the job will be marked as failed.
Caused by: cromwell.backend.validation.ValidatedRuntimeAttributesBuilder$$anon$1: Runtime attribute validation failed:
Expecting gpuCount runtime attribute value greater than 0
2022-02-14 16:48:44,341 cromwell-system-akka.dispatchers.engine-dispatcher-27 INFO - WorkflowManagerActor: Workflow actor for 45f6febb-8625-43ce-8bd5-fe0ab71d3fe7 completed with status 'Failed'. The workflow will be removed from the workflow store.
ERROR: Status of job is not Submitted, Running, or Succeeded: Failed
If ran with gpu_count >= 1
workflow run successfully.
Desired behaviour : gpu_count = 0
runs to completion, without being assigned a gpu from the backend.