k8s-device-plugin MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

Problem description： When configuring MPS with an A16 card, when the number of replicas is set to 6, it is reported that the device cannot be allocated (all devices that support CUDA are busy or unavailable). When the number of replicas is adjusted to 5, the operation is normal. The same problem also occurs on T4 cards and A30 cards. The same error is reported when the number of copies exceeds 21 on T4 and the number of copies on A30 exceeds 30. When MPS is enabled on A16, A30, and T4 cards, what is the maximum number of copies supported?

config Information： k8s-device-plugin： 0.17.0 GPU operate Version：24.9.0 cuda Version：12.4 cuda drvier Version：12.4 k8s Version：1.26 sample：pytorch 2.4.0+cu124

Logs： sample output： using cuda:0 device. Using 8 dataloader workers every process using 60000 images for training, 10000 images for validation. Traceback (most recent call last): File "/workspace/code/pt-examples/resnet/train.py", line 133, in main() File "/workspace/code/pt-examples/resnet/train.py", line 74, in main net.to(device) File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1174, in to return self._apply(convert) File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 780, in _apply module._apply(fn) File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 805, in _apply param_applied = fn(param) File "/root/.conda/envs/ptcuda124/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in convert return t.to( RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

	mps-control-deamon output：			
		I0107 06:20:18.770820     196 main.go:203] Retrieving MPS daemons.
		W0107 06:20:18.781612     196 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
		I0107 06:20:18.803608     196 daemon.go:97] "Staring MPS daemon" resource="nvidia.com/gpu"
		I0107 06:20:18.803664     196 daemon.go:156] "SELinux enabled, setting context" path="/mps/nvidia.com/gpu/pipe" context="system_u:object_r:container_file_t:s0"
		I0107 06:20:18.809265     196 daemon.go:139] "Starting log tailer" resource="nvidia.com/gpu"
		[2025-01-07 06:20:18.806 Control   209] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
		[2025-01-07 06:20:18.806 Control   209] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
		[2025-01-07 06:20:18.807 Control   209] Accepting connection...
		[2025-01-07 06:20:18.807 Control   209] NEW UI
		[2025-01-07 06:20:18.807 Control   209] Cmd:set_default_device_pinned_mem_limit 0 2559M
		[2025-01-07 06:20:18.807 Control   209] UI closed
		[2025-01-07 06:20:18.808 Control   209] Accepting connection...
		[2025-01-07 06:20:18.808 Control   209] NEW UI
		[2025-01-07 06:20:18.808 Control   209] Cmd:set_default_active_thread_percentage 16
		[2025-01-07 06:20:18.809 Control   209] 16.0
		[2025-01-07 06:20:18.809 Control   209] UI closed
	
	mps server output：
		[2025-01-07 12:57:36.231 Other    82] Startup
		[2025-01-07 12:57:36.231 Other    82] Connecting to control daemon on socket: /mps/nvidia.com/gpu/pipe/control
		[2025-01-07 12:57:36.231 Other    82] Initializing server process
		[2025-01-07 12:57:36.324 Server    82] Creating server context on device 0 (NVIDIA A16)
		[2025-01-07 12:57:36.457 Server    82] Created named shared memory region /cuda.shm.0.52.1
		[2025-01-07 12:57:36.457 Server    82] Active Threads Percentage set to 12.0
		[2025-01-07 12:57:36.457 Server    82] Device pinned memory limit for device 0 set to 0x77f00000 bytes
		[2025-01-07 12:57:36.457 Server    82] Server Priority set to 0
		[2025-01-07 12:57:36.457 Server    82] Server has started
		[2025-01-07 12:57:36.457 Server    82] Received new client request
		[2025-01-07 12:57:36.457 Server    82] Worker created
		[2025-01-07 12:57:36.457 Server    82] Creating worker thread
		[2025-01-07 12:57:36.515 Server    82] Received new client request
		[2025-01-07 12:57:36.516 Server    82] Worker created
		[2025-01-07 12:57:36.516 Server    82] Creating worker thread
		[2025-01-07 12:57:36.516 Server    82] Device NVIDIA A16 (uuid GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78) is associated
		[2025-01-07 12:57:36.516 Server    82] Status of client {0, 1} is ACTIVE
		[2025-01-07 12:57:36.657 Server    82] Client process 0 encountered a fatal GPU error.
		[2025-01-07 12:57:36.657 Server    82] Server is handling a fatal GPU error.
		[2025-01-07 12:57:36.657 Server    82] Status of client {0, 1} is INACTIVE
		[2025-01-07 12:57:36.657 Server    82] The following devices will be reset:
		[2025-01-07 12:57:36.657 Server    82] 0
		[2025-01-07 12:57:36.657 Server    82] The following clients have a sticky error set:
		[2025-01-07 12:57:36.657 Server    82] 0
		[2025-01-07 12:57:36.777 Server    82] Receive command failed, assuming client exit
		[2025-01-07 12:57:36.777 Server    82] Client {0, 1} exit
		[2025-01-07 12:57:36.777 Server    82] Client disconnected. Number of active client contexts is 0.
		[2025-01-07 12:57:36.777 Server    82] Destroy server context on device 0
		[2025-01-07 12:57:37.237 Server    82] Receive command failed, assuming client exit
		[2025-01-07 12:57:37.237 Server    82] Client process disconnected
	
	dmesg output：
		[  268.026103] NVRM: GPU at PCI:0000:00:0c: GPU-077f7bb7-c8e8-46db-77e3-e2c1e5665c78
		[  268.026131] NVRM: GPU Board Serial Number: 1321722074061
		[  268.026144] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
		[  579.875139] NVRM: Xid (PCI:0000:00:0c): 69, pid='<unknown>', name=<unknown>, Class Error: ChId 000a, Class 0000c7c0, Offset 000002ec, Data 00000000, ErrorCode 00000004
		
	nv-smi output：
           
			+-----------------------------------------------------------------------------------------+
			| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
			|-----------------------------------------+------------------------+----------------------+
			| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
			| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
			|                                         |                        |               MIG M. |
			|=========================================+========================+======================|
			|   0  NVIDIA A16                     Off |   00000000:00:0C.0 Off |                    0 |
			|  0%   49C    P8             16W /   62W |       1MiB /  15356MiB |      0%   E. Process |
			|                                         |                        |                  N/A |
			+-----------------------------------------+------------------------+----------------------+

			+-----------------------------------------------------------------------------------------+
			| Processes:                                                                              |
			|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
			|        ID   ID                                                               Usage      |
			|=========================================================================================|
			|  No running processes found                                                             |
			+-----------------------------------------------------------------------------------------+

Jan 13 '25 02:01 Thomas-syq

I also encountered a similar problem, did you find a solution?

Feb 26 '25 06:02 Ghostwritten

In the device plugin, resource limits for threads and memory are set for each MPS client. If an application’s resource requirements exceed these limits, it may cause such errors. You can try removing the restrictions perDevicePinnedDeviceMemoryLimits and activeThreadPercentage from the code.

Feb 28 '25 02:02 Thomas-syq