volcano icon indicating copy to clipboard operation
volcano copied to clipboard

"volocano.sh/vgpu-number" is not included in the allocatable resources.

Open dojoeisuke opened this issue 1 year ago • 21 comments

What happened:

I followed the user guide to set up vgpu, but "volocano.sh/vgpu-number" is not included in the allocatable resources.

user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md

What you expected to happen:

"volcano.sh/vgpu-number: XX" is included by executing the following command.

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
  "cpu": "2",
  "ephemeral-storage": "93492209510",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8050764Ki",
  "pods": "110"
}

How to reproduce it (as minimally and precisely as possible):

Prerequisites:

  • kubernetes cluster v1.24.3 is running
  • Installed volocano

Reproduce:

  1. Install nvidia drivers in new GPU worker node.
  2. Install nvidia-docker2 in new GPU worker node.
  3. Install kubernetes in new GPU worker node.
  4. Join new GPU worker node to kubernetes cluster.
  5. Install volcano-vgpu-plugin.

Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.

Anything else we need to know?:

Environment:

  • Volcano Version:

v1.8.0

  • Kubernetes version (use kubectl version):
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -owide
NAME                  STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-tryvolcano-w004   Ready    <none>   18h   v1.24.3   192.168.100.168   <none>        Ubuntu 20.04.6 LTS   5.4.0-72-generic   containerd://1.7.2
  • Cloud provider or hardware configuration:

Cloud provider: OpenStack

  • OS (e.g. from /etc/os-release):
root@k8s-tryvolcano-w004:~# cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
  • Kernel (e.g. uname -a):
root@k8s-tryvolcano-w004:~# uname -a
Linux k8s-tryvolcano-w004 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:

kubeadm

  • Others:

Nvidia driver

root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-driver
ii  nvidia-driver-535-server-open         535.104.12-0ubuntu0.20.04.1       amd64        NVIDIA driver (open kernel) metapackage

nvidia-docker2

root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-docker
ii  nvidia-docker2                        2.13.0-1                          all          nvidia-docker CLI wrapper

GPU

root@k8s-tryvolcano-w004:~# nvidia-smi 
Thu Oct 19 02:24:55 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:05.0 Off |                    0 |
| N/A   43C    P0              63W / 300W |      4MiB / 81920MiB |     20%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

"volocano-device-plugin" pod log

I1018 08:42:42.247448       1 main.go:77] Loading NVML
I1018 08:42:42.317422       1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465       1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759       1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770       1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783       1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498       1 register.go:89] Reporting devices  in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827       1 register.go:89] Reporting devices  in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190       1 register.go:89] Reporting devices  in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930       1 register.go:89] Reporting devices  in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805       1 register.go:89] Reporting devices  in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654       1 register.go:89] Reporting devices  in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609       1 register.go:89] Reporting devices  in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788       1 register.go:89] Reporting devices  in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138       1 register.go:89] Reporting devices  in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599       1 register.go:89] Reporting devices  in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977       1 register.go:89] Reporting devices  in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222       1 register.go:89] Reporting devices  in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451       1 register.go:89] Reporting devices  in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300       1 register.go:89] Reporting devices  in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850       1 register.go:89] Reporting devices  in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810       1 register.go:89] Reporting devices  in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763       1 register.go:89] Reporting devices  in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908       1 register.go:89] Reporting devices  in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563       1 register.go:89] Reporting devices  in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239       1 register.go:89] Reporting devices  in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125       1 register.go:89] Reporting devices  in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476       1 register.go:89] Reporting devices  in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003

... 

volcano-scheduler-configmap

root@k8s-tryvolcano-m001:~# kubectl get cm -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: true
        enableReclaimable: false
      - name: conformance
    - plugins:
      - name: overcommit
      - name: drf
        enablePreemptable: false
      - name: predicates
        arguments:
          predicate.VGPUEnable: true
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n  - name: priority\n  - name: gang\n    enablePreemptable: false\n  - name: conformance\n- plugins:\n  - name: overcommit\n  - name: drf\n    enablePreemptable: false\n  - name: predicates\n  - name: proportion\n  - name: nodeorder\n  - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
  creationTimestamp: "2023-09-21T04:44:44Z"
  name: volcano-scheduler-configmap
  namespace: volcano-system
  resourceVersion: "4267609"
  uid: 086455c9-7a0e-42b0-a938-4e56a6371207

dojoeisuke avatar Oct 19 '23 02:10 dojoeisuke

can you successfully launch vgpu task?

archlitchi avatar Oct 25 '23 09:10 archlitchi

can you successfully launch vgpu task?

No. The status of vcjob is pending

dojoeisuke avatar Oct 26 '23 04:10 dojoeisuke

can you successfully launch vgpu task?

No. The status of vcjob is pending

Thanks for your reply, can provide the following information?

  1. gpu node annotations by using (kubectl describe node )
  2. can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml

archlitchi avatar Oct 26 '23 07:10 archlitchi

can you successfully launch vgpu task?

No. The status of vcjob is pending

Thanks for your reply, can provide the following information?

  1. gpu node annotations by using (kubectl describe node )

The below is gpu node's annotations:

root@k8s-tryvolcano-m001:~# k describe node k8s-tryvolcano-w004 
Name:               k8s-tryvolcano-w004
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=k8s-tryvolcano-w004
                    kubernetes.io/os=linux
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.100.168/24
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.200.126
                    volumes.kubernetes.io/controller-managed-attach-detach: true
  1. can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml

The below is description of podgroup when the example is launched.

root@k8s-tryvolcano-m001:~# k apply -f https://raw.githubusercontent.com/volcano-sh/devices/master/examples/vgpu-case02.yml
pod/pod1 created
root@k8s-tryvolcano-m001:~# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          3m24s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE
podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7   Pending   1                      3m34s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7 

... 

Spec:
  Min Member:  1
  Min Resources:
    count/pods:                       1
    Pods:                             1
    requests.volcano.sh/vgpu-memory:  1024
    requests.volcano.sh/vgpu-number:  1
    volcano.sh/vgpu-memory:           1024
    volcano.sh/vgpu-number:           1
  Queue:                              default
Status:
  Conditions:
    Last Transition Time:  2023-10-26T09:20:01Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         24434f25-8ee7-4a06-a929-aa01c49b80a0
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                     From     Message
  ----     ------         ----                    ----     -------
  Warning  Unschedulable  3m48s (x12 over 3m59s)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
  Normal   Unschedulable  3m47s (x13 over 3m59s)  volcano  resource in cluster is overused

dojoeisuke avatar Oct 26 '23 09:10 dojoeisuke

"resource in cluster is overused" message means job is reject by enqueue action.

lowang-bh avatar Oct 26 '23 10:10 lowang-bh

"resource in cluster is overused" message means job is reject by enqueue action.

Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".

I1027 02:07:15.850507       1 proportion.go:230] The attributes of queue <default> in proportion: deserved <cpu 0.00, memory 0.00>, realCapability <cpu 10000.00, memory 24333668352.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00, volcano.sh/vgpu-number 1000.00>, elastic <cpu 0.00, memory 0.00>, share <0.00>
I1027 02:07:15.850531       1 proportion.go:242] Remaining resource is  <cpu 10000.00, memory 24333668352.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00>
I1027 02:07:15.850555       1 proportion.go:244] Exiting when remaining is empty or no queue has more resource request:  <cpu 10000.00, memory 24333668352.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>

volcano-scheduler.log

Note:

Since the past logs were no longer visible, pod1 was relaunched.

root@k8s-tryvolcano-m001:~# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          7m1s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE
podgroup-8fe3417c-53b2-4933-bf99-fd4c4298675f   Pending   1                      7m4s

dojoeisuke avatar Oct 27 '23 02:10 dojoeisuke

Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".

Yes, your node's describe show no volcano gpu informations!

lowang-bh avatar Oct 27 '23 02:10 lowang-bh

Now volcano-device-plugin pod on GPU node outputs "could not load NVML library".

root@k8s-tryvolcano-m001:~# k -n kube-system logs volcano-device-plugin-jtfxz 
I1027 05:40:47.592928       1 main.go:77] Loading NVML
I1027 05:40:47.593106       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1027 05:40:47.593135       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1027 05:40:47.593146       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1027 05:40:47.593169       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1027 05:40:47.593180       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1027 05:40:47.593211       1 main.go:44] failed to initialize NVML: could not load NVML library

How to reproduce it (as minimally and precisely as possible):

Prerequisites:

  • kubernetes cluster v1.24.3 is running
  • Installed volocano

Reproduce:

  1. Install nvidia drivers in new GPU worker node.
  2. Install nvidia-docker2 in new GPU worker node.
  3. Install kubernetes in new GPU worker node.
  4. Join new GPU worker node to kubernetes cluster.
  5. Install volcano-vgpu-plugin.

Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.

Unfortunately, the above reproduction steps were not accurate. The below is omitted:

  • Prerequisites
    • GPU Operator is installed using Helm in the kubernetes cluster.
  • Reproduce
    • GPU Operator is uninstalled between steps 4 and 5.

In other words, the fact that the NVML library was successfully loaded in the first log (below) might be due to the influence of the GPU operator.

"volocano-device-plugin" pod log

I1018 08:42:42.247448       1 main.go:77] Loading NVML
I1018 08:42:42.317422       1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465       1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759       1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770       1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783       1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498       1 register.go:89] Reporting devices  in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827       1 register.go:89] Reporting devices  in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190       1 register.go:89] Reporting devices  in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930       1 register.go:89] Reporting devices  in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805       1 register.go:89] Reporting devices  in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654       1 register.go:89] Reporting devices  in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609       1 register.go:89] Reporting devices  in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788       1 register.go:89] Reporting devices  in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138       1 register.go:89] Reporting devices  in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599       1 register.go:89] Reporting devices  in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977       1 register.go:89] Reporting devices  in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222       1 register.go:89] Reporting devices  in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451       1 register.go:89] Reporting devices  in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300       1 register.go:89] Reporting devices  in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850       1 register.go:89] Reporting devices  in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810       1 register.go:89] Reporting devices  in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763       1 register.go:89] Reporting devices  in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908       1 register.go:89] Reporting devices  in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563       1 register.go:89] Reporting devices  in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239       1 register.go:89] Reporting devices  in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125       1 register.go:89] Reporting devices  in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476       1 register.go:89] Reporting devices  in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003

... 

dojoeisuke avatar Oct 27 '23 06:10 dojoeisuke

@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?

archlitchi avatar Oct 27 '23 06:10 archlitchi

@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?

root@k8s-tryvolcano-w004:~# cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

dojoeisuke avatar Oct 27 '23 06:10 dojoeisuke

can this issue be reproduced without install Gpu Operator?

archlitchi avatar Oct 27 '23 07:10 archlitchi

can this issue be reproduced without install Gpu Operator?

I tried it.

volocano-device-plugin pod on GPU node produced the following error output.

I1030 05:12:02.805254       1 main.go:77] Loading NVML
I1030 05:12:02.805419       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498       1 main.go:44] failed to initialize NVML: could not load NVML library

Also, the example manifest was not scheduling to GPU node.

root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9m8s

dojoeisuke avatar Oct 30 '23 05:10 dojoeisuke

can this issue be reproduced without install Gpu Operator?

I tried it.

volocano-device-plugin pod on GPU node produced the following error output.

I1030 05:12:02.805254       1 main.go:77] Loading NVML
I1030 05:12:02.805419       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498       1 main.go:44] failed to initialize NVML: could not load NVML library

Also, the example manifest was not scheduling to GPU node.

root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9m8s

Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works

archlitchi avatar Oct 30 '23 07:10 archlitchi

There was an inadequacy in preparing the GPU node. In Kubernetes 1.24, it was necessary to install cri-dockerd and specify cri-dockerd as the cri-socket for "kubelet".

  • https://github.com/Mirantis/cri-dockerd

As a result, "volcano.sh/vgpu-number" is inclued in "allocatable" as expected.

root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
  "cpu": "2",
  "ephemeral-storage": "93492209510",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "8050772Ki",
  "pods": "110",
  "volcano.sh/vgpu-number": "10"
}

dojoeisuke avatar Oct 30 '23 07:10 dojoeisuke

Next I tried to launch a example manifest,

Note: the following fields was changed:

  • image: nvidia/cuda:10.1-base-ubuntu18.04 -> nvidia/cuda:12.1.0-base-ubuntu18.04
  • vgpu-number: 1 -> 2

it failed due to the lack of resources.

root@k8s-tryvolcano-m001:~# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          80s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME                                            STATUS    MINMEMBER   RUNNINGS   AGE
podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5   Inqueue   1                      47s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5 

(snip)

Spec:
  Min Member:  1
  Min Resources:
    count/pods:                       1
    Pods:                             1
    requests.volcano.sh/vgpu-memory:  1024
    requests.volcano.sh/vgpu-number:  2
    volcano.sh/vgpu-memory:           1024
    volcano.sh/vgpu-number:           2
  Queue:                              default
Status:
  Conditions:
    Last Transition Time:  2023-10-30T07:59:03Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         84edb100-71c5-44d7-8c55-c5dabd7ae74f
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                From     Message
  ----     ------         ----               ----     -------
  Warning  Unschedulable  1s (x13 over 13s)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

Does this mean there is still an inadequacy in preparing the GPU node?

dojoeisuke avatar Oct 30 '23 08:10 dojoeisuke

can this issue be reproduced without install Gpu Operator?

I tried it. volocano-device-plugin pod on GPU node produced the following error output.

I1030 05:12:02.805254       1 main.go:77] Loading NVML
I1030 05:12:02.805419       1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428       1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431       1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467       1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473       1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498       1 main.go:44] failed to initialize NVML: could not load NVML library

Also, the example manifest was not scheduling to GPU node.

root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9m8s

Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works

It was successful.

root@k8s-tryvolcano-w004:~# docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash
Unable to find image 'ubuntu:18.04' locally
18.04: Pulling from library/ubuntu
7c457f213c76: Pull complete 
Digest: sha256:152dc042452c496007f07ca9127571cb9c29697f42acbfad72324b2bb2e43c98
Status: Downloaded newer image for ubuntu:18.04
root@3b1a7f3abe05:/# nvidia-smi
Mon Oct 30 08:23:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:00:05.0 Off |                    0 |
| N/A   35C    P0              44W / 300W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@3b1a7f3abe05:/# exit
exit
root@k8s-tryvolcano-w004:~# 

dojoeisuke avatar Oct 30 '23 08:10 dojoeisuke

@archlitchi

About https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784644826 , since "volocano.sh/vgpu-number" has become part of the allocatable resources, would it be better to close this issue? Also, should I submit a new issue about https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784664057 ?

dojoeisuke avatar Nov 01 '23 05:11 dojoeisuke

Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2

archlitchi avatar Nov 01 '23 09:11 archlitchi

Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2

The above URL seems to redirects to https://docs.nvidia.com/datacenter/cloud-native/index.html. Is the following URL correct? https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html

dojoeisuke avatar Nov 01 '23 10:11 dojoeisuke

Is your problem fixed?@dojoeisuke. And is it caused by docker removed in kubernets v1.24? @archlitchi

Monokaix avatar Jan 19 '24 03:01 Monokaix

@Monokaix

The problem has not been resolved, but I personally find it difficult to continue the investigation, so I will temporarily close this issue. Thank you for your support. @archlitchi

dojoeisuke avatar Jan 25 '24 04:01 dojoeisuke