MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!
Problem description: When configuring MPS with an A16 card, when the number of replicas is set to 6, it is reported that the device cannot be allocated (all devices that support CUDA are busy or unavailable). When the number of replicas is adjusted to 5, the operation is normal. The same problem also occurs on T4 cards and A30 cards. The same error is reported when the number of copies exceeds 21 on T4 and the number of copies on A30 exceeds 30. When MPS is enabled on A16, A30, and T4 cards, what is the maximum number of copies supported?
config Information: k8s-device-plugin: 0.17.0 GPU operate Version:24.9.0 cuda Version:12.4 cuda drvier Version:12.4 k8s Version:1.26 sample:pytorch 2.4.0+cu124
Logs:
sample output:
using cuda:0 device.
Using 8 dataloader workers every process
using 60000 images for training, 10000 images for validation.
Traceback (most recent call last):
File "/workspace/code/pt-examples/resnet/train.py", line 133, in TORCH_USE_CUDA_DSA to enable device-side assertions.
mps-control-deamon output:
I0107 06:20:18.770820 196 main.go:203] Retrieving MPS daemons.
W0107 06:20:18.781612 196 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
I0107 06:20:18.803608 196 daemon.go:97] "Staring MPS daemon" resource="nvidia.com/gpu"
I0107 06:20:18.803664 196 daemon.go:156] "SELinux enabled, setting context" path="/mps/nvidia.com/gpu/pipe" context="system_u:object_r:container_file_t:s0"
I0107 06:20:18.809265 196 daemon.go:139] "Starting log tailer" resource="nvidia.com/gpu"
[2025-01-07 06:20:18.806 Control 209] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2025-01-07 06:20:18.806 Control 209] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2025-01-07 06:20:18.807 Control 209] Accepting connection...
[2025-01-07 06:20:18.807 Control 209] NEW UI
[2025-01-07 06:20:18.807 Control 209] Cmd:set_default_device_pinned_mem_limit 0 2559M
[2025-01-07 06:20:18.807 Control 209] UI closed
[2025-01-07 06:20:18.808 Control 209] Accepting connection...
[2025-01-07 06:20:18.808 Control 209] NEW UI
[2025-01-07 06:20:18.808 Control 209] Cmd:set_default_active_thread_percentage 16
[2025-01-07 06:20:18.809 Control 209] 16.0
[2025-01-07 06:20:18.809 Control 209] UI closed
mps server output:
[2025-01-07 12:57:36.231 Other 82] Startup
[2025-01-07 12:57:36.231 Other 82] Connecting to control daemon on socket: /mps/nvidia.com/gpu/pipe/control
[2025-01-07 12:57:36.231 Other 82] Initializing server process
[2025-01-07 12:57:36.324 Server 82] Creating server context on device 0 (NVIDIA A16)
[2025-01-07 12:57:36.457 Server 82] Created named shared memory region /cuda.shm.0.52.1
[2025-01-07 12:57:36.457 Server 82] Active Threads Percentage set to 12.0
[2025-01-07 12:57:36.457 Server 82] Device pinned memory limit for device 0 set to 0x77f00000 bytes
[2025-01-07 12:57:36.457 Server 82] Server Priority set to 0
[2025-01-07 12:57:36.457 Server 82] Server has started
[2025-01-07 12:57:36.457 Server 82] Received new client request
[2025-01-07 12:57:36.457 Server 82] Worker created
[2025-01-07 12:57:36.457 Server 82] Creating worker thread
[2025-01-07 12:57:36.515 Server 82] Received new client request
[2025-01-07 12:57:36.516 Server 82] Worker created
[2025-01-07 12:57:36.516 Server 82] Creating worker thread
[2025-01-07 12:57:36.516 Server 82] Device NVIDIA A16 (uuid GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78) is associated
[2025-01-07 12:57:36.516 Server 82] Status of client {0, 1} is ACTIVE
[2025-01-07 12:57:36.657 Server 82] Client process 0 encountered a fatal GPU error.
[2025-01-07 12:57:36.657 Server 82] Server is handling a fatal GPU error.
[2025-01-07 12:57:36.657 Server 82] Status of client {0, 1} is INACTIVE
[2025-01-07 12:57:36.657 Server 82] The following devices will be reset:
[2025-01-07 12:57:36.657 Server 82] 0
[2025-01-07 12:57:36.657 Server 82] The following clients have a sticky error set:
[2025-01-07 12:57:36.657 Server 82] 0
[2025-01-07 12:57:36.777 Server 82] Receive command failed, assuming client exit
[2025-01-07 12:57:36.777 Server 82] Client {0, 1} exit
[2025-01-07 12:57:36.777 Server 82] Client disconnected. Number of active client contexts is 0.
[2025-01-07 12:57:36.777 Server 82] Destroy server context on device 0
[2025-01-07 12:57:37.237 Server 82] Receive command failed, assuming client exit
[2025-01-07 12:57:37.237 Server 82] Client process disconnected
dmesg output:
[ 268.026103] NVRM: GPU at PCI:0000:00:0c: GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78
[ 268.026131] NVRM: GPU Board Serial Number: 1321722074061
[ 268.026144] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
[ 579.875139] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
nv-smi output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A16 Off | 00000000:00:0C.0 Off | 0 |
| 0% 49C P8 16W / 62W | 1MiB / 15356MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I also encountered a similar problem, did you find a solution?
In the device plugin, resource limits for threads and memory are set for each MPS client. If an application’s resource requirements exceed these limits, it may cause such errors. You can try removing the restrictions perDevicePinnedDeviceMemoryLimits and activeThreadPercentage from the code.