k8s-device-plugin
k8s-device-plugin copied to clipboard
MPS Memory limits confusion
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 5.15.0-112-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.6.12
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
2. Issue or feature description
I've configured MPS on an NVIDIA LS40 with 10 replicas.
As per the mps daemon logs, a default memory limit of ~4GB has been set.
I0612 11:23:33.074061 53 main.go:187] Retrieving MPS daemons.
I0612 11:23:33.153182 53 daemon.go:93] "Staring MPS daemon" resource="nvidia.com/gpu"
I0612 11:23:33.218453 53 daemon.go:131] "Starting log tailer" resource="nvidia.com/gpu"
[2024-06-12 10:28:13.702 Control 69] Starting control daemon using socket /mps/nvidia.com/gpu/pipe/control
[2024-06-12 10:28:13.702 Control 69] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
[2024-06-12 10:28:13.725 Control 69] Accepting connection...
[2024-06-12 10:28:13.725 Control 69] NEW UI
[2024-06-12 10:28:13.725 Control 69] Cmd:set_default_device_pinned_mem_limit 0 4606M
However, if I look at this from the point of view of a client:
import torch
torch.cuda.get_device_properties(torch.device('cuda'))
# _CudaDeviceProperties(name='NVIDIA L40S', major=8, minor=9, total_memory=45589MB, multi_processor_count=14)
Only the set_default_active_thread_percentage
of 10 is respected. The multi_processor_count
changes from 142
to 14
.
Here's some additional info from the application pod:
printenv | grep CUDA
CUDA_MPS_PIPE_DIRECTORY=/mps/nvidia.com/gpu/pipe
echo "get_default_device_pinned_mem_limit 0" | nvidia-cuda-mps-control
4G
Why is nvidia-cuda-mps-control
reporting one thing for memory and pytorch
saying something else? This doesn't look right to me, but maybe I'm missing something. If I use MIG with an A100, the total_memory
returned reflects the MIG instance as opposed to the total VRAM of the card.
# values.yaml
nodeSelector: {
nvidia.com/gpu: "true"
}
gfd:
enabled: true
nameOverride: gpu-feature-discovery
namespaceOverride: {{ nvidia_plugin.namespace }}
nodeSelector: {
nvidia.com/gpu: "true"
}
nfd:
master:
nodeSelector: {
nvidia.com/gpu: "true"
}
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
worker:
nodeSelector: {
nvidia.com/gpu: "true"
}
config:
default: "default"
map:
default: |-
ls400: |-
version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 10
Additional information that might help better understand your environment and reproduce the bug:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:BE:00.0 Off | 0 |
| N/A 32C P8 35W / 350W | 35MiB / 46068MiB | 0% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 70756 C nvidia-cuda-mps-server 28MiB |
+-----------------------------------------------------------------------------------------+