The MPS container has started running, but cannot call GPU resources inside the container
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
- Kernel Version:5.15.0-113-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):K8s 1.7.18
2. Issue or feature description
After successfully deploying nvidia device plugin using the command, I also successfully made the GPU 10 schedulable. However, when using MPS mode in the container, YOLO found that it was unable to successfully call the GPU when calling resources. But timeSlicing is actually possible。 Do I need to enable certain functions? I have already opened “nvidia cuda mps control - d”, and using “nvidia smi” in the container can also view GPU resources
3. Information to attach (optional if deemed irrelevant)
root@VM-16-14-ubuntu:/dev/shm# nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:00:08.0 Off | 0 |
| N/A 35C P0 23W / 300W | 32MiB / 32768MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 93272 C nvidia-cuda-mps-server 30MiB |
+---------------------------------------------------------------------------------------+
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
cat << EOF > /tmp/dp-config.yaml
version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 10
EOF
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--set tolerations[0].key=node-role.kubernetes.io/edge \
--set tolerations[0].operator=Exists \
--set tolerations[0].effect=NoSchedule \
--set tolerations[1].key=nvidia.com/gpu \
--set tolerations[1].operator=Exists \
--set tolerations[1].effect=NoSchedule \
--set-file config.map.config=/tmp/dp-config.yaml
/
root@master01:/home/ubuntu# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
nvidia-device-plugin nvidia-device-plugin-5cdkb 2/2 Running 0 69m
nvidia-device-plugin nvidia-device-plugin-mps-control-daemon-zr4hj 2/2 Running 0 69m
.......
.....
root@master01:/home/ubuntu# kubectl describe nodes edgenode-test
Name: edgenode-test
Roles: agent,edge
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-model.vendor_id=NVIDIA
feature.node.kubernetes.io/pci-10de.present=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=edgenode-test
kubernetes.io/os=linux
node-role.kubernetes.io/agent=
node-role.kubernetes.io/edge=
nos.nebuly.com/gpu-partitioning=mps
nvidia.com/gpu.present=true
nvidia.com/mps.capable=true
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 05 Jul 2024 18:09:42 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: edgenode-test
AcquireTime: <unset>
RenewTime: Mon, 08 Jul 2024 17:56:17 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, 08 Jul 2024 17:54:25 +0800 Mon, 08 Jul 2024 15:06:35 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 08 Jul 2024 17:54:25 +0800 Mon, 08 Jul 2024 15:06:35 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 08 Jul 2024 17:54:25 +0800 Mon, 08 Jul 2024 15:06:35 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 08 Jul 2024 17:54:25 +0800 Mon, 08 Jul 2024 15:06:35 +0800 EdgeReady edge is posting ready status. AppArmor enabled
Addresses:
InternalIP: 119.45.165.216
Hostname: edgenode-test
Capacity:
cpu: 10
ephemeral-storage: 489087124Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 40284092Ki
nvidia.com/gpu: 10
nvidia.com/gpu.shared: 0
pods: 110
Allocatable:
cpu: 10
ephemeral-storage: 450742692733
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 40181692Ki
nvidia.com/gpu: 10
nvidia.com/gpu.shared: 0
pods: 110
System Info:
Machine ID: 36721b810c324ab782c87a701e59cb09
System UUID: 36721b81-0c32-4ab7-82c8-7a701e59cb09
Boot ID: 86d43f47-cd9f-451a-9d44-fcb9ceb70d73
Kernel Version: 5.15.0-113-generic
OS Image: Ubuntu 22.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.18
Kubelet Version: v1.28.6-kubeedge-v1.17.0
Kube-Proxy Version: v0.0.0-master+$Format:%H$
PodCIDR: 192.168.21.0/24
PodCIDRs: 192.168.21.0/24
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
edge video-57f9659fb9-gpttn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19m
edge video-b58c6685c-gzg2g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19m
kubeedge cloud-iptables-manager-k9v9c 100m (1%) 200m (2%) 25Mi (0%) 50Mi (0%) 2d23h
kubeedge edge-eclipse-mosquitto-6c2mm 100m (1%) 200m (2%) 50Mi (0%) 100Mi (0%) 2d23h
nvidia-device-plugin nvidia-device-plugin-5cdkb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70m
nvidia-device-plugin nvidia-device-plugin-mps-control-daemon-zr4hj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 70m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 200m (2%) 400m (4%)
memory 75Mi (0%) 150Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 2 2
nvidia.com/gpu.shared 0 0
Events: <none>
My Deployment:
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations: {}
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
name: video
namespace: edge
resourceVersion: '1597692'
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
creationTimestamp: null
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
spec:
containers:
- image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
imagePullPolicy: IfNotPresent
name: video
resources:
limits:
nvidia.com/gpu: '1'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: shm
dnsPolicy: ClusterFirst
nodeName: edgenode-test
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /dev/shm
type: Directory
name: shm
My Deployment:
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations: {}
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
name: video
namespace: edge
resourceVersion: '1597692'
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
creationTimestamp: null
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
spec:
containers:
- image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
imagePullPolicy: IfNotPresent
name: video
resources:
limits:
nvidia.com/gpu: '1'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: shm
dnsPolicy: ClusterFirst
nodeName: edgenode-test
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /dev/shm
type: Directory
name: shm
Meet a similar issue and find a comment at https://github.com/NVIDIA/k8s-device-plugin/issues/467#issuecomment-1974252052 Remove /dev/shm from your deployment and try again. Could you report back if this works or not?
My Deployment:
--- apiVersion: apps/v1 kind: Deployment metadata: annotations: {} labels: k8s.kuboard.cn/layer: svc k8s.kuboard.cn/name: video name: video namespace: edge resourceVersion: '1597692' spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 1 selector: matchLabels: k8s.kuboard.cn/layer: svc k8s.kuboard.cn/name: video strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00' creationTimestamp: null labels: k8s.kuboard.cn/layer: svc k8s.kuboard.cn/name: video spec: containers: - image: 'harbor.moolink.net/moolink/video-supervision:v1.2' imagePullPolicy: IfNotPresent name: video resources: limits: nvidia.com/gpu: '1' terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /dev/shm name: shm dnsPolicy: ClusterFirst nodeName: edgenode-test restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /dev/shm type: Directory name: shm
~Running with MPS requires you set hostPID: true in the Pod spec field which I don't see you've done. I suspect this would resolve the issue.~ Correction: This is only required in GKE.
why after
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--set-file config.map.config=config.yaml
no a nvidia-device-plugin-mps-control-daemon pod?
kubectl get pod -n nvidia-device-plugin
NAME READY STATUS RESTARTS AGE
nvdp-nvidia-device-plugin-fmhlh 2/2 Running 0 21m
and nvidia.com/gpu: 0
Capacity:
cpu: 16
ephemeral-storage: 309506092Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131500512Ki
nvidia.com/gpu: 0
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 285240813915
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131193312Ki
nvidia.com/gpu: 0
pods: 110
use nvidia-device-plugin-0.16.1
Hi any update on mps shm configurable in last releases? we cannot use it like this each of workload is using different shm
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.