intel-device-plugins-for-kubernetes
intel-device-plugins-for-kubernetes copied to clipboard
Understanding and controlling multi-GPU behavior
Describe the support request Hi there, this is a bit of a follow up on my previous issue (https://github.com/intel/intel-device-plugins-for-kubernetes/issues/1769).
What is the behavior of the GPU plugin on a multi-Intel-GPU system when installing with NFD where an app requests a GPU with (assume only i915 driver enabled on host):
resources:
limits:
gpu.intel.com/i915: 1
For example:
- Which GPU device will be used for the first app requesting a GPU? Is there any way to control this?
- It appears from the docs that when using
sharedDevNum=Nall slots will be filled on one of the GPUs before apps are scheduled on the next GPU? Is that right?
System (please complete the following information if applicable):
- OS version: Ubuntu 22.04, 24.04
- Kernel version: 6.8.0-40-generic
- Device plugins version: v0.30.0
- Hardware info: iGPU and dGPU
Hi @frenchwr, sorry for the delay. I forgot to answer.
What is the behavior of the GPU plugin on a multi-Intel-GPU system when installing with NFD where an app requests a GPU with (assume only i915 driver enabled on host):
We don't really support multi-GPU scenarios where there are different GPU types in one node. They are registered under the same i915 resource name and it isn't possible to request a specific one from them.
- Which GPU device will be used for the first app requesting a GPU? Is there any way to control this?
Scheduler gives us a list of possible devices and from them the first n devices are selected based on how many i915 resources are requested. When used with "shared-dev-num > 1", allocationPolicy changes which GPUs are filled first.
- It appears from the docs that when using
sharedDevNum=Nall slots will be filled on one of the GPUs before apps are scheduled on the next GPU? Is that right?
Depends on the policy. With "packed" policy, first GPU is filled, then second, then third etc. With "balanced", first container goes to gpu1, then gpu2, gpu3, gpu1, gpu2, gpu3, gpu1 etc. where there are three GPUs on a node.
@tkatila - no worries! I appreciate the reply.
Depends on the policy. With "packed" policy, first GPU is filled, then second, then third etc. With "balanced", first container goes to gpu1, then gpu2, gpu3, gpu1, gpu2, gpu3, gpu1 etc. where there are three GPUs on a node.
The situation I'm imagining is a user has an iGPU and dGPU on a single system. I think in most scenarios a user will prefer the dGPU be used first, but on an example system I see the dGPU listed as card1 while the iGPU is listed as card0:
# Intel® Arc ™ Pro A60M Graphics
card1 8086:56b2 pci:vendor=8086,device=56B2,card=0
└─renderD129
# Intel® Iris® Xe Graphics (Raptor Lake)
card0 8086:a7a0 pci:vendor=8086,device=A7A0,card=0
└─renderD128
Does this mean the plugin would use the iGPU first?
on an example system I see the dGPU listed as
card1while the iGPU is listed ascard0
Device name indexes and device file names in general, come from sysfs & devfs i.e. kernel.
Container runtimes map the whole host sysfs to containers. While it would be possible to map a device file to /dev/dri/ within the container using some other name than what device has in sysfs, that would only cause problems, because many applications scan also sysfs.
Note: not mapping the device file names has also a problem, but that's limited to legacy media APIs: https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/cmd/gpu_plugin/README.md#issues-with-media-workloads-on-multi-gpu-setups
Does this mean the plugin would use the iGPU first?
GPU plugin will just list the devices for k8s scheduler as extended resources. It's k8s scheduler which will then select one of them. So yes, it may be first selection.
GAS (with some help from GPU plugin) can provide extra control over that: https://github.com/intel/platform-aware-scheduling/blob/master/gpu-aware-scheduling/docs/usage.md
You could use GAS denylist container annotation for ignoring card0, or ask for a resource missing from iGPUs (VRAM), or just disable iGPU from BIOS.
PS. @tkatila, I remember you earlier were looking into GPU plugin option for ignoring iGPUs. Did anything come out of it, I don't see it mentioned in GPU plugin README?
You could use GAS denylist container annotation for ignoring card0, or ask for a resource missing from iGPUs (VRAM), or just disable iGPU from BIOS.
GAS assumes homogeneous cluster. If node has 2 GPUs and reports 16GB of VRAM, GAS calculates that each GPU has 8GB of VRAM. So using VRAM as a resource doesn't work. Denylist has the issue with cards enumerating in different order.
PS. @tkatila, I remember you earlier were looking into GPU plugin option for ignoring iGPUs. Did anything come out of it, I don't see it mentioned in GPU plugin README?
Yeah, I did wonder about it. But it got buried under other things. The idea was to have two methods:
- Whitelist
- Only register GPUs of certain PCI Device ID
- Resource renaming based on GPU type
- Rename the
i915resource asi915-flex,i915-arcetc. or justi915-0x1234by the PCI Device ID.
- Rename the
In general, I don't think one should depend on having card0 to be iGPU or the other way around. They can enumerate in different order in some boots and then the wrong GPU would be used. Also, I'm not sure scheduler returns the device list as sorted.
Denylist has the issue with cards enumerating in different order. ... In general, I don't think one should depend on having card0 to be iGPU or the other way around. They can enumerate in different order in some boots and then the wrong GPU would be used.
I don't think I've ever seem (Intel) iGPU as anything else than card0, so I thought driver checks that before dGPUs. Have you seen it under some other name?
Note: if there are other than Intel GPUs and their kernel drivers, then it's possible other GPU driver would be loaded before Intel one, meaning that Intel indexes would not start from 0 => denylist is not be a general solution, just a potential workaround for this particular ticket (which I thought to be about nodes having only Intel GPUs)...
I don't think I've ever seem (Intel) iGPU as anything else than
card0, so I thought driver checks that before dGPUs. Have you seen it under some other name?Note: if there are other than Intel GPUs and their kernel drivers, then it's possible other GPU driver would be loaded before Intel one, meaning that Intel indexes would not start from
0=>denylistis not be a general solution, just a potential workaround for this particular ticket (which I thought to be about nodes having only Intel GPUs)...
If only Intel cards are in the host, yes, then card0 will be Intel. I've seen cases where KVM has occupied card0 and Intel cards have then taken card1 etc.
But the point I was trying to say was that depending on the boot, iGPU could be card0 or card1. At least with multiple cards the PCI address of the cards vary between boots.
With #2101 one can tell GPU plugin to accept or deny GPUs having specific IDs.
What if it would be possible to run multiple GPU plugin set instances with different allow/deny lists and different resource names specified for them? One set would have allowlist for all dGPUs, and another would have it as denylist (resource names for them could then be e.g. i915_dgpu, i915_igpu or xe_dgpu, xe_igpu).
Admin would then have full control over how things are set up. Downside of that freedom would be configs between such clusters not being compatible due to admin selected names, nodes (potentially) running multiple plugin instances, and that device operator could (currently) manage only one of those sets.
What if it would be possible to run multiple GPU plugin set instances with different allow/deny lists and different resource names specified for them? One set would have allowlist for all dGPUs, and another would have it as denylist (resource names for them could then be e.g. i915_dgpu, i915_igpu or xe_dgpu, xe_igpu).
Yep, that's what I also thought about.
Admin would then have full control over how things are set up. Downside of that freedom would be configs between such clusters not being compatible due to admin selected names, nodes (potentially) running multiple plugin instances, and that device operator could (currently) manage only one of those sets.
Yep again. Workload modifications due to resource name differences and multiple plugin Pods per node would be the main downsides. Device plugin operator nowadays can run multiple plugin instances.
Plugin would need to be modified so that each plugin has its own socket.
Going to piggyback here in hopes of some insight.
I have a k8s cluster with 8 nodes. 4 nodes have 2 Arc GPUs. 4 nodes have 6 Arc GPUs.
Using the operator with no shared-dev-num=1, it seems like the operator isn't providing access to GPUs 3,4,5,6 on the larger nodes. Although the operator assigns the device properly to the pod, ffmpeg is not able to access the device.
Are there any tricks for running with a larger number of GPUs?
I have a k8s cluster with 8 nodes. 4 nodes have 2 Arc GPUs. 4 nodes have 6 Arc GPUs.
Nice! :)
Using the operator with no shared-dev-num=1, it seems like the operator isn't providing access to GPUs 3,4,5,6 on the larger nodes. Although the operator assigns the device properly to the pod, ffmpeg is not able to access the device.
Are there any tricks for running with a larger number of GPUs?
Are all the GPUs same on the 6xGPU nodes? i.e. there are no integrated or different Arc variants?
When the workload hits the 3,4,5,6 GPUs, what's the error?
In general, whether there are 1 or 8 GPUs, the behavior should be identical.