gpu-manager icon indicating copy to clipboard operation
gpu-manager copied to clipboard

It seems that gpu-manager can't limit the usage of GPU of the container of Jupyter

Open Thatwho opened this issue 4 years ago • 5 comments

I have ran a Jupyter in K8S with gpu-manager to limit the total GPU usage, here is the yaml of replicaset:

ame:           konkii-0000000012-jupyter-6f6975d555
Namespace:      konkii-0000000012-user
Selector:       app=konkii-0000000012-jupyter,pod-template-hash=6f6975d555
Labels:         app=konkii-0000000012-jupyter
                pod-template-hash=6f6975d555
Annotations:    deployment.kubernetes.io/desired-replicas: 1
                deployment.kubernetes.io/max-replicas: 2
                deployment.kubernetes.io/revision: 2
Controlled By:  Deployment/konkii-0000000012-jupyter
Replicas:       1 current / 1 desired
Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  app=konkii-0000000012-jupyter
           pod-template-hash=6f6975d555
  Containers:
   konkii-0000000012-jupyter:
    Image:      registry.konkii.com:8443/jupyter/pytorch1.6_gpu:v1.0
    Port:       8888/TCP
    Host Port:  0/TCP
    Limits:
      cpu:                       14
      memory:                    28Gi
      tencent.com/vcuda-core:    100
      tencent.com/vcuda-memory:  32
    Requests:
      cpu:                       500m
      memory:                    512Mi
      tencent.com/vcuda-core:    100
      tencent.com/vcuda-memory:  32
    Environment:
      BASE_URL:  /notebook/jupyter/konkii-0000000012-jupyter.konkii-0000000012-user/8888/7KQMbq/
    Mounts:
      /home/jovyan from notebook-data (rw)
      /home/jovyan/dataset/LCQMC from dataset-data (ro,path="d8e23966c8c03be396ee38d68906f271")
      /home/jovyan/dataset/place_10 from dataset-data (ro,path="c4105cde45cc11eb9eec52bd5ff39142/10place")

but the pod has actually used the whole GPU:

Every 1.0s: nvidia-smi                                                                                                                                                               gpu2: Fri Feb  5 11:32:04 2021

Fri Feb  5 11:32:05 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82	  Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   27C    P0    34W / 250W |  16154MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000002:00:00.0 Off |                    0 |
| N/A   26C    P0    33W / 250W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0   1412798      C   /opt/conda/envs/pytorch_1.6/bin/python     15575MiB |
|    0   1432636      C   /opt/conda/envs/pytorch_1.6/bin/python       567MiB |
+-----------------------------------------------------------------------------+

I can't figure out the reason caused this issue.

Thatwho avatar Feb 05 '21 03:02 Thatwho

You request 100 core, that means you want a full gpu card, there is no limit when you did this.

mYmNeo avatar Feb 05 '21 09:02 mYmNeo

@mYmNeo I also have this issue. As show above, “tencent.com/vcuda-memory: 32” means 8192MiB, but 16154MiB is used actually, so it cannot limit GPU memory only?

qifengz avatar Mar 19 '21 03:03 qifengz

update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?

qifengz avatar Mar 19 '21 03:03 qifengz

update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?

Why would you want to limit GPU memory if you use the full card? I assume this is intended.

timozerrer avatar Apr 23 '21 08:04 timozerrer

update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?

Why would you want to limit GPU memory if you use the full card? I assume this is intended.

I wonder the 1% margin of tencent.com/vcuda-core causes such a different result.

qifengz avatar Apr 30 '21 01:04 qifengz