cromwell icon indicating copy to clipboard operation
cromwell copied to clipboard

[Enhancement suggestion] Allow for null gpuCount on Google backend

Open simojoe opened this issue 3 years ago • 2 comments

Dear Cromwell dev team,

This is an enhancement suggestion.

When using the google backend for resources allocation, one can specify gpuCount and gpuType to request for specific resources. I am currently trying to design a task that optionally needs to access a GPU (function of input/parameters). I tried different approach to dynamically schedule GPUs, but gpuCount seems constrain to a non-null positive integer.

https://github.com/broadinstitute/cromwell/blob/bfef756ca35b46570dff3fda57f77dd4b2b0d25c/supportedBackends/google/pipelines/common/src/main/scala/cromwell/backend/google/pipelines/common/PipelinesApiRuntimeAttributes.scala#L190

https://github.com/broadinstitute/cromwell/blob/bfef756ca35b46570dff3fda57f77dd4b2b0d25c/supportedBackends/google/pipelines/common/src/main/scala/cromwell/backend/google/pipelines/common/GpuValidation.scala#L28-L40

To allow for dynamic access to GPUs, I propose to extend gpuCount type to allow for a null value, and to check for a non-null value for resource allocation.

https://github.com/broadinstitute/cromwell/blob/bfef756ca35b46570dff3fda57f77dd4b2b0d25c/supportedBackends/google/pipelines/common/src/main/scala/cromwell/backend/google/pipelines/common/PipelinesApiRuntimeAttributes.scala#L193

Please let me know if such a feature is not desired for any reason.

* I tried accessing the Jira tracker but doesn't have access to Jira on broadworkbench.atlassian.net.

simojoe avatar Feb 14 '22 14:02 simojoe

Can you post a minimal example WDL demonstrating the issue?

aednichols avatar Feb 14 '22 16:02 aednichols

Sure !

This is a minimal workflow that runs a task with a dynamic number of GPUs


workflow gpu_example {

  call maybe_gpu {
    input:
      gpu_count = 0
  }

}

task maybe_gpu {

  input {
    Int gpu_count
  }

  command {
    echo 1
  }

  runtime {
    docker: "ubuntu:16.04"
    gpuCount: gpu_count
    gpuType: "nvidia-tesla-t4"
  }
}

When ran with gpu_count = 0, the cromwell runtime validation fails because it is expecting a non-null integer.

2022-02-14 16:48:34,798 cromwell-system-akka.dispatchers.engine-dispatcher-7 INFO  - WorkflowExecutionActor-45f6febb-8625-43ce-8bd5-fe0ab71d3fe7 [UUID(45f6febb)]: Starting gpu_example.maybe_gpu
2022-02-14 16:48:39,643 cromwell-system-akka.dispatchers.engine-dispatcher-25 INFO  - Assigned new job execution tokens to the following groups: 45f6febb: 1
2022-02-14 16:48:41,244 cromwell-system-akka.dispatchers.backend-dispatcher-31 ERROR - Runtime attribute validation failed:
Expecting gpuCount runtime attribute value greater than 0
cromwell.backend.validation.ValidatedRuntimeAttributesBuilder$$anon$1: Runtime attribute validation failed:
Expecting gpuCount runtime attribute value greater than 0
2022-02-14 16:48:42,011 cromwell-system-akka.dispatchers.engine-dispatcher-26 INFO  - WorkflowManagerActor: Workflow 45f6febb-8625-43ce-8bd5-fe0ab71d3fe7 failed (during ExecutingWorkflowState): cromwell.backend.standard.StandardSyncExecutionActor$$anonfun$jobFailingDecider$1$$anon$1: PipelinesApiAsyncBackendJobExecutionActor failed and didn't catch its exception. This condition has been handled and the job will be marked as failed.
Caused by: cromwell.backend.validation.ValidatedRuntimeAttributesBuilder$$anon$1: Runtime attribute validation failed:
Expecting gpuCount runtime attribute value greater than 0

2022-02-14 16:48:44,341 cromwell-system-akka.dispatchers.engine-dispatcher-27 INFO  - WorkflowManagerActor: Workflow actor for 45f6febb-8625-43ce-8bd5-fe0ab71d3fe7 completed with status 'Failed'. The workflow will be removed from the workflow store.
ERROR: Status of job is not Submitted, Running, or Succeeded: Failed

If ran with gpu_count >= 1 workflow run successfully.

Desired behaviour : gpu_count = 0 runs to completion, without being assigned a gpu from the backend.

simojoe avatar Feb 14 '22 18:02 simojoe