k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

How can we distinguish between regular slices and ME slices during pod scheduling?

Open kittywaresz opened this issue 2 months ago • 2 comments

I faced a problem with resource exposure from GPU with MIG enabled, I can see that there is no difference between 1g.10gb and 1g.10g+me GPU instances from the perspective of k8s-device-plugin

Actual behavior

Here is my mig-manager config file:

config19-config19:
- devices: [0]
  mig-devices: &id001
    1g.10gb: 5
    1g.10gb+me: 1
    1g.20gb: 1
  mig-enabled: true
- devices: [1]
  mig-devices: *id001
  mig-enabled: true

When I applied this config on my node with 2 A100 80GBs attached to it, I got these resources exposed by the device plugin:

kind: Node
status:
  capacity:
    ...
    nvidia.com/gpu: "0"
    nvidia.com/mig-1g.10gb: "12"  # (1 x 1g.10gb + 1 x 1g.10gb+me) * 2
    nvidia.com/mig-1g.20gb: "2"   # 1 x 1g20gb * 2
    ...

As you can see, there is no difference between 1g.10gb and 1g.10g+me GPU instances

Desired behavior

I would like to see this:

kind: Node
status:
  capacity:
    ...
    nvidia.com/gpu: "0"
    nvidia.com/mig-1g.10gb: "10"    # 5 x 1g.10gb * 2
    nvidia.com/mig-1g.10gb.me: "2"  # 1 x 1g.10gb+me * 2
    nvidia.com/mig-1g.20gb: "2"     # 1 x 1g20gb * 2
    ...

Why do I need this

I want to run a Pod that needs to be running on ME GPU instance only, but I don't think it's possible. I can neither set it as a nvidia.com/mig-1g.10gb.me resource nor specify nodeAffinity rules, since using affinity rules I can only target the desired node, but not the device on this node

How can I achieve this? Maybe there are k8s-device-plugin config flags or something else?

kittywaresz avatar Oct 20 '25 12:10 kittywaresz