v0.15.0 added nodeSelector "nvidia.com/mps.capable": "true"?
Hi!
I have successfully used v0.14.0 with AWS EKS to correctly identify GPU's of AL2 instances. However, with newer versions (starting from v0.15.0), it seems that the daemonset unexpectedly requires "MPS capable" nodes only:
"nodeSelector": {
"nvidia.com/mps.capable": "true"
},
However, in previous versions, the affinity configuration looks like this:
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "feature.node.kubernetes.io/pci-10de.present",
"operator": "In",
"values": [
"true"
]
}
]
},
{
"matchExpressions": [
{
"key": "feature.node.kubernetes.io/cpu-model.vendor_id",
"operator": "In",
"values": [
"NVIDIA"
]
}
]
},
{
"matchExpressions": [
{
"key": "nvidia.com/gpu.present",
"operator": "In",
"values": [
"true"
]
}
]
}
]
}
}
},
This allows us to apply the label 'nvidia.com/gpu.present': 'true' to force-run the daemon on instances created by AWS ASG and allows us to scale from zero.
Could you please document the recommended way to run the daemonset on the required nodes when using AWS EKS and Cluster Autoscaler, which scales GPU instances from zero?
Best regards, Markus
@markusl the
"nodeSelector": {
"nvidia.com/mps.capable": "true"
},
should only be defined for the MPS control deamon daemonset and is only applicable if MPS is used to apply space partitioning to existing GPUs.
See: https://github.com/NVIDIA/k8s-device-plugin/blob/b6b81a6126a8f9061e0b0cb67355e5ceec37d254/deployments/helm/nvidia-device-plugin/templates/daemonset-mps-control-daemon.yml#L207-L208
Thanks for the quick answer! I am using AWS CDK for the deployment, which pulls the Helm chart automatically to the cluster
cluster.addHelmChart('nvidia-device-plugin', {
chart: 'nvidia-device-plugin',
repository: 'https://nvidia.github.io/k8s-device-plugin',
namespace: 'kube-system',
version: '0.17.0', // <- causes nodeSelector with "nvidia.com/mps.capable": "true" to appear
});
This has changed between 0.14.0 and 0.15.0 as far as I can tell. Is there something specific that I need to configure to avoid the nodeselector from appearing?
The device plugin DaemonSet does not have any affinity for the MPS label, only the MPS control DaemonSet has such an affinity.
Here's the current affinity setting as of 0.17.0 of the device plugin's DaemonSet:
$ k -n nvidia-device-plugin get ds nvdp-nvidia-device-plugin -o yaml | yq .spec.template.spec.affinity
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present
operator: In
values:
- "true"
- matchExpressions:
- key: feature.node.kubernetes.io/cpu-model.vendor_id
operator: In
values:
- NVIDIA
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
Xref to the default values in which nodeSelector is empty here. Xref to device plugin's Helm template here showing where these are used. And xref to the MPS control daemon's template here where it explicitly shows a static label of nvidia.com/mps.capable: "true" being used.
It sounds like that label is expected from the GPU feature discovery daemonset, which is optional.
@markusl I'm seeing the same behavior. How were you able to resolve this. The default values file may not assign a value to the nodeSelector but it's automatically being populated with nvidia.com/mps.capable: "true"
@jicowan it wasn't resolved, unfortunately.
Are you adding a label to the node then?
Sent from Outlookhttps://aka.ms/qtex0l for iOS
From: Markus Lindqvist @.> Sent: Monday, March 17, 2025 2:16:33 AM To: NVIDIA/k8s-device-plugin @.> Cc: Jeremy Cowan @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] v0.15.0 added nodeSelector "nvidia.com/mps.capable": "true"? (Issue #1085)
@jicowanhttps://github.com/jicowan it wasn't resolved, unfortunately.
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/k8s-device-plugin/issues/1085#issuecomment-2728418706, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACCYJK75PW2JO3DH6TT7H7T2UZZFDAVCNFSM6AAAAABS7LDRJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGQYTQNZQGY. You are receiving this because you were mentioned.Message ID: @.***>
[markusl]markusl left a comment (NVIDIA/k8s-device-plugin#1085)https://github.com/NVIDIA/k8s-device-plugin/issues/1085#issuecomment-2728418706
@jicowanhttps://github.com/jicowan it wasn't resolved, unfortunately.
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/k8s-device-plugin/issues/1085#issuecomment-2728418706, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACCYJK75PW2JO3DH6TT7H7T2UZZFDAVCNFSM6AAAAABS7LDRJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGQYTQNZQGY. You are receiving this because you were mentioned.Message ID: @.***>