knix icon indicating copy to clipboard operation
knix copied to clipboard

Enable dynamic GPU scheduling

Open ksatzke opened this issue 5 years ago • 2 comments

Currently, the resource limits for KNIX components, when using helm charts for deployments, are fixed at deployment time, like so:

resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 1
        memory: 1Gi

For each workflow deployment, its allowance for GPU support should also be available for configuration at workflow deployment time, to enable dynamic definition of workflow requirements to run on GPUs instead of CPUs at workflow deployment time, and for KNIX to enable scheduling of the workflow on a node which still has sufficient GPUs cores available, like so:

resources:
      limits:
        cpu: 1
        memory: 2Gi
        nvidia.com/gpu: 1 # requesting 1 GPU
  • add the option to define GPU requirements per workflow to the GUI
  • store workflow requirement limits together with workflow data
  • extend management service to evaluate and handle workflow requirement limits for GPU and handle GPU scheduling
  • add node labelling capabilities to KNIX

ksatzke avatar Jul 30 '20 10:07 ksatzke

These need to be done in the feature/GPU_support_extended branch, right?

iakkus avatar Jul 30 '20 11:07 iakkus

If we can agree on the issue, we can perform implementation using this branch to extend KNIX GPU support, right.

ksatzke avatar Jul 30 '20 11:07 ksatzke