gpu-manager
gpu-manager copied to clipboard
It seems that gpu-manager can't limit the usage of GPU of the container of Jupyter
I have ran a Jupyter in K8S with gpu-manager to limit the total GPU usage, here is the yaml of replicaset:
ame: konkii-0000000012-jupyter-6f6975d555
Namespace: konkii-0000000012-user
Selector: app=konkii-0000000012-jupyter,pod-template-hash=6f6975d555
Labels: app=konkii-0000000012-jupyter
pod-template-hash=6f6975d555
Annotations: deployment.kubernetes.io/desired-replicas: 1
deployment.kubernetes.io/max-replicas: 2
deployment.kubernetes.io/revision: 2
Controlled By: Deployment/konkii-0000000012-jupyter
Replicas: 1 current / 1 desired
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=konkii-0000000012-jupyter
pod-template-hash=6f6975d555
Containers:
konkii-0000000012-jupyter:
Image: registry.konkii.com:8443/jupyter/pytorch1.6_gpu:v1.0
Port: 8888/TCP
Host Port: 0/TCP
Limits:
cpu: 14
memory: 28Gi
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 32
Requests:
cpu: 500m
memory: 512Mi
tencent.com/vcuda-core: 100
tencent.com/vcuda-memory: 32
Environment:
BASE_URL: /notebook/jupyter/konkii-0000000012-jupyter.konkii-0000000012-user/8888/7KQMbq/
Mounts:
/home/jovyan from notebook-data (rw)
/home/jovyan/dataset/LCQMC from dataset-data (ro,path="d8e23966c8c03be396ee38d68906f271")
/home/jovyan/dataset/place_10 from dataset-data (ro,path="c4105cde45cc11eb9eec52bd5ff39142/10place")
but the pod has actually used the whole GPU:
Every 1.0s: nvidia-smi gpu2: Fri Feb 5 11:32:04 2021
Fri Feb 5 11:32:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 27C P0 34W / 250W | 16154MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000002:00:00.0 Off | 0 |
| N/A 26C P0 33W / 250W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1412798 C /opt/conda/envs/pytorch_1.6/bin/python 15575MiB |
| 0 1432636 C /opt/conda/envs/pytorch_1.6/bin/python 567MiB |
+-----------------------------------------------------------------------------+
I can't figure out the reason caused this issue.
You request 100 core, that means you want a full gpu card, there is no limit when you did this.
@mYmNeo I also have this issue. As show above, “tencent.com/vcuda-memory: 32” means 8192MiB, but 16154MiB is used actually, so it cannot limit GPU memory only?
update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?
update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?
Why would you want to limit GPU memory if you use the full card? I assume this is intended.
update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?
Why would you want to limit GPU memory if you use the full card? I assume this is intended.
I wonder the 1% margin of tencent.com/vcuda-core causes such a different result.