[Core feature] Greater flexibility with GPUs accelerated workloads

Open Sovietaced opened this issue 3 months ago • 1 comments

Motivation: Why do you think this is important?

Right now the configuration around GPU accelerated workloads is quite rigid. Flyte propeller only allows for a global GPU resource name to be configured which makes impossible to use Flyte with data planes that have heterogenous GPU resource vendors/names.

Additionally, it is assumed that all GPU accelerated compute nodes will require a singular node selector label or taint key/value. In our environment our compute nodes contain multiple taints.

We have worked around some of these issues using pod templates but we hit a blocker when we needed to support fractionalized GPUs with nvidia.com/gpu.shared.

Goal: What should the final outcome look like, ideally?

A more flexible configuration for GPU accelerated workloads where custom resource names, node selectors, and tolerations can be configured.

Describe alternatives you've considered

None

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

[x] Yes

Have you read the Code of Conduct?

[x] Yes

Nov 15 '25 00:11 Sovietaced

Flyte currently allows configuration of the GPU resource name, node selector labels, and tolerations via Helm values, but this is global and not per-task or per-GPU-type. You can set gpu-resource-name, gpu-device-node-label, gpu-partition-size-node-label, and tolerations in the FlytePropeller K8s plugin config, but these are not dynamically adjustable for heterogeneous clusters or multiple GPU vendors. Fractional GPUs like nvidia.com/gpu.shared are not explicitly documented as supported, and per-task custom resource names or multiple taints/selectors are not natively configurable—workarounds with pod templates have known limitations, especially for advanced GPU partitioning and shared resources. This limitation is acknowledged in the community, and more granular, flexible configuration is a known feature gap (source, source, source).

If you need to support multiple GPU resource names, custom node selectors, or complex taint/toleration setups per task or per GPU type, this is not fully supported in current Flyte releases. The only partial workaround is to use pod templates, but as you noted, this does not cover all advanced use cases (e.g., fractional GPUs). Would you like more detail on current configuration options or ongoing community discussions about this feature?

Sources:

This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

Nov 15 '25 00:11 runllm[bot]