Workloads keep in hang state except cuda-sample:vectoradd under MPS mode
1. Quick Debug Information
- OS/Version(Garden Linux 934.11):
- Kernel Version: 5.15.135-gardenlinux-amd64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd:/1.6.20
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s/v1.26.11
2. Issue or feature description
To confirm that the MPS feature is capable of catering to our needs, I'm scheduling a comprehensive test suite for further analysis.
I've been testing the MPS feature offered by the recently updated device-plugin V15.0 on our K8S BM node equipped with 3 V100s. However, it only seems to execute the standard test cuda-sample:vectoradd successfully, while the rest of the test cases continually remain in a hang state, thus not progressing as expected.
✅Passed case
- cuda-samples/vectorAdd
❌Failed case
- tf-notebook tensorflow/tensorflow:latest-gpu-jupyter
3. Information to attach (optional if deemed irrelevant)
I am using the following configmap to enable the MPS under our GPU BM
# k apply -f ntr-mps-cm.yaml
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ntr-mps-cm
namespace: ssdl-vgpu
data:
any: |
version: v1
flags:
failOnInitError: true
nvidiaDriverRoot: "/run/nvidia/driver/"
plugin:
deviceListStrategy: envvar
deviceIDStrategy: uuid
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 4
And use the following commands to install the latest device-plugin
helm install --wait k8s-vgpu nvdp/nvidia-device-plugin \
--namespace ssdl-vgpu \
--version 0.15.0 \
--set config.name=ntr-time-cm \
--set compatWithCPUManager=true \
And the related logs can be found in
Logs of the MPS control daemon and device-plugin
MPS control daemon logs
2024-04-23T06:04:59.093748766Z I0423 06:04:59.092447 55 main.go:78] Starting NVIDIA MPS Control Daemon 435bfb70
2024-04-23T06:04:59.093839331Z commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
2024-04-23T06:04:59.093853836Z I0423 06:04:59.092736 55 main.go:55] "Starting NVIDIA MPS Control Daemon" version=<
2024-04-23T06:04:59.093867206Z 435bfb70
2024-04-23T06:04:59.093878874Z commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
2024-04-23T06:04:59.093889630Z >
2024-04-23T06:04:59.093900576Z I0423 06:04:59.092874 55 main.go:107] Starting OS watcher.
2024-04-23T06:04:59.094030621Z I0423 06:04:59.093463 55 main.go:121] Starting Daemons.
2024-04-23T06:04:59.094062157Z I0423 06:04:59.093551 55 main.go:164] Loading configuration.
2024-04-23T06:04:59.094686515Z I0423 06:04:59.094616 55 main.go:172] Updating config with default resource matching patterns.
2024-04-23T06:04:59.094774486Z I0423 06:04:59.094718 55 main.go:183]
2024-04-23T06:04:59.094792480Z Running with config:
2024-04-23T06:04:59.094796229Z {
2024-04-23T06:04:59.094799702Z "version": "v1",
2024-04-23T06:04:59.094802996Z "flags": {
2024-04-23T06:04:59.094806479Z "migStrategy": "none",
2024-04-23T06:04:59.094809741Z "failOnInitError": true,
2024-04-23T06:04:59.094813226Z "nvidiaDriverRoot": "/run/nvidia/driver/",
2024-04-23T06:04:59.094816715Z "gdsEnabled": null,
2024-04-23T06:04:59.094820702Z "mofedEnabled": null,
2024-04-23T06:04:59.094823952Z "useNodeFeatureAPI": null,
2024-04-23T06:04:59.094827235Z "plugin": {
2024-04-23T06:04:59.094830383Z "passDeviceSpecs": null,
2024-04-23T06:04:59.094833529Z "deviceListStrategy": [
2024-04-23T06:04:59.094836708Z "envvar"
2024-04-23T06:04:59.094840338Z ],
2024-04-23T06:04:59.094844148Z "deviceIDStrategy": "uuid",
2024-04-23T06:04:59.094847589Z "cdiAnnotationPrefix": null,
2024-04-23T06:04:59.094850905Z "nvidiaCTKPath": null,
2024-04-23T06:04:59.094854382Z "containerDriverRoot": null
2024-04-23T06:04:59.094857781Z }
2024-04-23T06:04:59.094861393Z },
2024-04-23T06:04:59.094870585Z "resources": {
2024-04-23T06:04:59.094873870Z "gpus": [
2024-04-23T06:04:59.094877070Z {
2024-04-23T06:04:59.094882321Z "pattern": "*",
2024-04-23T06:04:59.094885531Z "name": "nvidia.com/gpu"
2024-04-23T06:04:59.094888738Z }
2024-04-23T06:04:59.094891938Z ]
2024-04-23T06:04:59.094895130Z },
2024-04-23T06:04:59.094898375Z "sharing": {
2024-04-23T06:04:59.094901556Z "timeSlicing": {},
2024-04-23T06:04:59.094904765Z "mps": {
2024-04-23T06:04:59.094908027Z "failRequestsGreaterThanOne": true,
2024-04-23T06:04:59.094911202Z "resources": [
2024-04-23T06:04:59.094914378Z {
2024-04-23T06:04:59.094917668Z "name": "nvidia.com/gpu",
2024-04-23T06:04:59.094920946Z "devices": "all",
2024-04-23T06:04:59.094924462Z "replicas": 4
2024-04-23T06:04:59.094927756Z }
2024-04-23T06:04:59.094931025Z ]
2024-04-23T06:04:59.094934399Z }
2024-04-23T06:04:59.094937659Z }
2024-04-23T06:04:59.094940881Z }
2024-04-23T06:04:59.094944445Z I0423 06:04:59.094737 55 main.go:187] Retrieving MPS daemons.
2024-04-23T06:04:59.367188348Z I0423 06:04:59.366922 55 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu"
2024-04-23T06:04:59.443650558Z I0423 06:04:59.443468 55 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
2024-04-23T06:04:59.445568425Z [2024-04-23 06:04:59.389 Control 72] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
2024-04-23T06:04:59.445618049Z [2024-04-23 06:04:59.389 Control 72] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
2024-04-23T06:04:59.445627888Z [2024-04-23 06:04:59.415 Control 72] Accepting connection...
2024-04-23T06:04:59.445635791Z [2024-04-23 06:04:59.415 Control 72] NEW UI
2024-04-23T06:04:59.445643954Z [2024-04-23 06:04:59.415 Control 72] Cmd:set_default_device_pinned_mem_limit 0 4096M
2024-04-23T06:04:59.445651895Z [2024-04-23 06:04:59.415 Control 72] UI closed
2024-04-23T06:04:59.445659400Z [2024-04-23 06:04:59.418 Control 72] Accepting connection...
2024-04-23T06:04:59.445667180Z [2024-04-23 06:04:59.418 Control 72] NEW UI
2024-04-23T06:04:59.445674877Z [2024-04-23 06:04:59.418 Control 72] Cmd:set_default_device_pinned_mem_limit 1 4096M
2024-04-23T06:04:59.445682790Z [2024-04-23 06:04:59.419 Control 72] UI closed
2024-04-23T06:04:59.445690190Z [2024-04-23 06:04:59.439 Control 72] Accepting connection...
2024-04-23T06:04:59.445697598Z [2024-04-23 06:04:59.439 Control 72] NEW UI
2024-04-23T06:04:59.445705544Z [2024-04-23 06:04:59.439 Control 72] Cmd:set_default_device_pinned_mem_limit 2 4096M
2024-04-23T06:04:59.445713416Z [2024-04-23 06:04:59.439 Control 72] UI closed
2024-04-23T06:04:59.445720849Z [2024-04-23 06:04:59.442 Control 72] Accepting connection...
2024-04-23T06:04:59.445728295Z [2024-04-23 06:04:59.442 Control 72] NEW UI
2024-04-23T06:04:59.445735742Z [2024-04-23 06:04:59.442 Control 72] Cmd:set_default_active_thread_percentage 25
2024-04-23T06:04:59.445749210Z [2024-04-23 06:04:59.442 Control 72] 25.0
2024-04-23T06:04:59.445757584Z [2024-04-23 06:04:59.442 Control 72] UI closed
2024-04-23T06:05:26.660115113Z [2024-04-23 06:05:26.659 Control 72] Accepting connection...
2024-04-23T06:05:26.660166122Z [2024-04-23 06:05:26.659 Control 72] NEW UI
2024-04-23T06:05:26.660179349Z [2024-04-23 06:05:26.659 Control 72] Cmd:get_default_active_thread_percentage
2024-04-23T06:05:26.660189426Z [2024-04-23 06:05:26.659 Control 72] 25.0
Nvidia-device-plugin logs
I0423 06:04:56.479370 39 main.go:279] Retrieving plugins.
I0423 06:04:56.480927 39 factory.go:104] Detected NVML platform: found NVML library
I0423 06:04:56.480999 39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0423 06:04:56.558122 39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0423 06:04:56.558152 39 main.go:208] Failed to start one or more plugins. Retrying in 30s...
I0423 06:05:26.587365 39 main.go:315] Stopping plugins.
I0423 06:05:26.587429 39 server.go:185] Stopping to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0423 06:05:26.587496 39 main.go:200] Starting Plugins.
I0423 06:05:26.587508 39 main.go:257] Loading configuration.
I0423 06:05:26.588327 39 main.go:265] Updating config with default resource matching patterns.
I0423 06:05:26.588460 39 main.go:276]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/run/nvidia/driver/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {},
"mps": {
"failRequestsGreaterThanOne": true,
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 4
}
]
}
}
}
I0423 06:05:26.588478 39 main.go:279] Retrieving plugins.
I0423 06:05:26.588515 39 factory.go:104] Detected NVML platform: found NVML library
I0423 06:05:26.588559 39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0423 06:05:26.660603 39 server.go:176] "MPS daemon is healthy" resource="nvidia.com/gpu"
I0423 06:05:26.661346 39 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I0423 06:05:26.662857 39 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0423 06:05:26.673960 39 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
✅Passed case
cuda-samples and check the nvidia-smi log
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: N/A |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 25W / 250W | 34MiB / 16384MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE-16GB Off | 00000000:AF:00.0 Off | 0 |
| N/A 31C P0 27W / 250W | 34MiB / 16384MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE-16GB Off | 00000000:D8:00.0 Off | 0 |
| N/A 31C P0 27W / 250W | 34MiB / 16384MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1127738 C nvidia-cuda-mps-server 30MiB |
| 0 N/A N/A 1397691 M+C /cuda-samples/vectorAdd 18MiB |
| 0 N/A N/A 1397696 M+C /cuda-samples/vectorAdd 10MiB |
| 1 N/A N/A 1127738 C nvidia-cuda-mps-server 30MiB |
| 1 N/A N/A 1397689 M+C /cuda-samples/vectorAdd 94MiB |
| 1 N/A N/A 1397714 M+C /cuda-samples/vectorAdd 10MiB |
| 2 N/A N/A 1127738 C nvidia-cuda-mps-server 30MiB |
| 2 N/A N/A 1397712 M+C /cuda-samples/vectorAdd 10MiB |
+---------------------------------------------------------------------------------------+
❌ Failed case
I am using this classification.ipynb to have a E2E verify about the MPS use in tensorflow, while it turns out that the pod keeps hang 60 mins and don't get any response, and the log can not get more details.
classification.ipynb logs
2024-04-25 02:10:19.499819: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I 2024-04-25 02:10:20.630 ServerApp] Connecting to kernel 562db3f6-327c-4adc-83e9-735fb4d0042c.
[I 2024-04-25 02:10:36.700 ServerApp] Connecting to kernel 562db3f6-327c-4adc-83e9-735fb4d0042c.
And here is the current MPS daemon log
[2024-04-25 02:10:31.040 Control 74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.153 Control 74] Accepting connection...
[2024-04-25 02:10:31.153 Control 74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.335 Control 74] Accepting connection...
[2024-04-25 02:10:31.336 Control 74] User did not send valid credentials
[2024-04-25 02:10:31.336 Control 74] Accepting connection...
[2024-04-25 02:10:31.336 Control 74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.344 Control 74] Accepting connection...
[2024-04-25 02:10:31.344 Control 74] NEW CLIENT 0 from user 0: Server already exists
[2024-04-25 02:10:31.389 Control 74] Accepting connection...
[2024-04-25 02:10:31.389 Control 74] User did not send valid credentials
[2024-04-25 02:10:31.389 Control 74] Accepting connection...
[2024-04-25 02:10:31.389 Control 74] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-04-25 02:10:31.535 Control 74] Accepting connection...
[2024-04-25 02:10:31.535 Control 74] User did not send valid credentials
[2024-04-25 02:10:31.536 Control 74] Accepting connection...
[2024-04-25 02:10:31.536 Control 74] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-04-25 02:10:31.590 Control 74] Server 91 exited with status 0
[2024-04-25 02:10:31.590 Control 74] Starting new server 7246 for user 0
[2024-04-25 02:10:31.618 Control 74] Accepting connection...
[2024-04-25 02:10:31.884 Control 74] NEW SERVER 7246: Ready
[2024-04-25 02:10:31.893 Control 74] Accepting connection...
[2024-04-25 02:10:31.893 Control 74] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-04-25 02:18:33.152 Control 74] Accepting connection...
[2024-04-25 02:18:33.152 Control 74] User did not send valid credentials
[2024-04-25 02:18:33.152 Control 74] Accepting connection...
[2024-04-25 02:18:33.152 Control 74] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
@elezar Could u PTAL?
I already try the echo get_default_active_thread_percentage | nvidia-cuda-mps-control from https://github.com/NVIDIA/k8s-device-plugin/issues/647, everything looks fine and the hang process issue still happens.
root@gpu-pod:/# echo get_default_active_thread_percentage | nvidia-cuda-mps-control
25.0
Hi Team Why are my pods getting multiprocessor count as 4 instead of 40? issue observed after upgrading our Amazon EKS cluster to version 1.26. We are utilizing NVIDIA Tesla T4 GPUs, which, under normal conditions, provide a multi-processor count 40. However, post-upgrade, the multi-processor count appears to be reduced to 4 whenever multiple pods are deployed on the same node.
kubectl logs nvdp-nvidia-device-plugin-mps-control-daemon-6vzfc -c mps-control-daemon-ctr -n kube-system I0521 08:58:36.057729 51 main.go:78] Starting NVIDIA MPS Control Daemon 435bfb70 commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e I0521 08:58:36.057863 51 main.go:55] "Starting NVIDIA MPS Control Daemon" version=< 435bfb70 commit: 435bfb70a44b74daca23fe957a0f256afaa3c51e
I0521 08:58:36.057879 51 main.go:107] Starting OS watcher. I0521 08:58:36.058077 51 main.go:121] Starting Daemons. I0521 08:58:36.058105 51 main.go:164] Loading configuration. I0521 08:58:36.058484 51 main.go:172] Updating config with default resource matching patterns. I0521 08:58:36.058532 51 main.go:183] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": null, "gdsEnabled": null, "mofedEnabled": null, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": null, "deviceListStrategy": null, "deviceIDStrategy": null, "cdiAnnotationPrefix": null, "nvidiaCTKPath": null, "containerDriverRoot": null } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {}, "mps": { "failRequestsGreaterThanOne": true, "resources": [ { "name": "nvidia.com/gpu", "devices": "all", "replicas": 10 } ] } } } I0521 08:58:36.058544 51 main.go:187] Retrieving MPS daemons. I0521 08:58:36.089222 51 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu" I0521 08:58:36.092824 51 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu" [2024-05-21 08:58:36.090 Control 66] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control [2024-05-21 08:58:36.090 Control 66] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe [2024-05-21 08:58:36.091 Control 66] Accepting connection... [2024-05-21 08:58:36.091 Control 66] NEW UI [2024-05-21 08:58:36.091 Control 66] Cmd:set_default_device_pinned_mem_limit 0 1536M [2024-05-21 08:58:36.091 Control 66] UI closed [2024-05-21 08:58:36.092 Control 66] Accepting connection... [2024-05-21 08:58:36.092 Control 66] NEW UI [2024-05-21 08:58:36.092 Control 66] Cmd:set_default_active_thread_percentage 10 [2024-05-21 08:58:36.092 Control 66] 10.0 [2024-05-21 08:58:36.092 Control 66] UI closed [2024-05-21 08:59:03.055 Control 66] Accepting connection... [2024-05-21 08:59:03.055 Control 66] NEW UI [2024-05-21 08:59:03.055 Control 66] Cmd:get_default_active_thread_percentage [2024-05-21 08:59:03.055 Control 66] 10.0 [2024-05-21 08:59:03.055 Control 66] UI closed [2024-05-21 08:59:21.924 Control 66] Accepting connection... [2024-05-21 08:59:21.924 Control 66] User did not send valid credentials [2024-05-21 08:59:21.924 Control 66] Accepting connection... [2024-05-21 08:59:21.924 Control 66] User did not send valid credentials [2024-05-21 08:59:21.924 Control 66] Accepting connection... [2024-05-21 08:59:21.924 Control 66] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list [2024-05-21 08:59:21.925 Control 66] Starting new server 74 for user 0 [2024-05-21 08:59:21.925 Control 66] Accepting connection... [2024-05-21 08:59:21.925 Control 66] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list [2024-05-21 08:59:21.929 Control 66] Accepting
@PrakChandra you seem to have MPS configured with a replication factor of 10. This also enforces the maximum active thread percentage for clients consuming the GPU. This is also visible as a limited SM count for CUDA applications.
@elezar
I reduced the count to 2
**sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 2**
Then I am getting the count as 20
But I want SM to be 40 as earlier, So when I set the count as 1 it throws the error like this
How can I get the SM count as 40?
@PrakChandra if you want full access to the GPU then remove the MPS sharing config. This will mean that the device is not shared.
@elezar I want to run multiple pods(approx 8) on 1 GPU. So I am using MPS for that purpose. I understand your answer, that will provide me with 40 SM but only one pod can be scheduled over my node.
Is there a way to have the SM count as 40 across all pods by setting the replicas to 1, is it possible?
Are you saying you want to limit the amount of memory each of your 8 workloads can consume, but not limit the compute?
@klueska I want full memory and compute access across all pods. I have g4dn.2xlarge instance with the following config
I want my 8 workloads to access the memory as per their requirement but full compute.
I was able to configure my node earlier but with the latest kernel upgrades by AWS, the SM count becomes 4 for the same config I was using earlier. Note: I was configuring my EC2 by manually installing the NVIDIA DRIVERS and the daemonset as well.
So I switched to Optimised EKS GPU node and enabled MPS in order to schedule my 8 workloads to 1 GPU.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.