volcano
volcano copied to clipboard
"volocano.sh/vgpu-number" is not included in the allocatable resources.
What happened:
I followed the user guide to set up vgpu, but "volocano.sh/vgpu-number" is not included in the allocatable resources.
user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md
What you expected to happen:
"volcano.sh/vgpu-number: XX" is included by executing the following command.
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
"cpu": "2",
"ephemeral-storage": "93492209510",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "8050764Ki",
"pods": "110"
}
How to reproduce it (as minimally and precisely as possible):
Prerequisites:
- kubernetes cluster v1.24.3 is running
- Installed volocano
Reproduce:
- Install nvidia drivers in new GPU worker node.
- Install nvidia-docker2 in new GPU worker node.
- Install kubernetes in new GPU worker node.
- Join new GPU worker node to kubernetes cluster.
- Install volcano-vgpu-plugin.
Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.
Anything else we need to know?:
Environment:
- Volcano Version:
v1.8.0
- Kubernetes version (use
kubectl version
):
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-tryvolcano-w004 Ready <none> 18h v1.24.3 192.168.100.168 <none> Ubuntu 20.04.6 LTS 5.4.0-72-generic containerd://1.7.2
- Cloud provider or hardware configuration:
Cloud provider: OpenStack
- OS (e.g. from /etc/os-release):
root@k8s-tryvolcano-w004:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
- Kernel (e.g.
uname -a
):
root@k8s-tryvolcano-w004:~# uname -a
Linux k8s-tryvolcano-w004 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Install tools:
kubeadm
- Others:
Nvidia driver
root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-driver
ii nvidia-driver-535-server-open 535.104.12-0ubuntu0.20.04.1 amd64 NVIDIA driver (open kernel) metapackage
nvidia-docker2
root@k8s-tryvolcano-w004:~# dpkg -l | grep nvidia-docker
ii nvidia-docker2 2.13.0-1 all nvidia-docker CLI wrapper
GPU
root@k8s-tryvolcano-w004:~# nvidia-smi
Thu Oct 19 02:24:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:00:05.0 Off | 0 |
| N/A 43C P0 63W / 300W | 4MiB / 81920MiB | 20% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
"volocano-device-plugin" pod log
I1018 08:42:42.247448 1 main.go:77] Loading NVML
I1018 08:42:42.317422 1 main.go:91] Starting FS watcher.
I1018 08:42:42.317465 1 main.go:98] Starting OS watcher.
I1018 08:42:42.317759 1 main.go:116] Retreiving plugins.
I1018 08:42:42.317770 1 main.go:155] No devices found. Waiting indefinitely.
I1018 08:42:42.317783 1 register.go:101] into WatchAndRegister
I1018 08:42:42.360498 1 register.go:89] Reporting devices in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880
I1018 08:43:12.468827 1 register.go:89] Reporting devices in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968
I1018 08:43:42.485190 1 register.go:89] Reporting devices in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532
I1018 08:44:12.505930 1 register.go:89] Reporting devices in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182
I1018 08:44:42.523805 1 register.go:89] Reporting devices in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722
I1018 08:45:12.542654 1 register.go:89] Reporting devices in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943
I1018 08:45:42.564609 1 register.go:89] Reporting devices in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270
I1018 08:46:12.584788 1 register.go:89] Reporting devices in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381
I1018 08:46:42.653138 1 register.go:89] Reporting devices in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620
I1018 08:47:12.674599 1 register.go:89] Reporting devices in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183
I1018 08:47:42.690977 1 register.go:89] Reporting devices in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640
I1018 08:48:12.707222 1 register.go:89] Reporting devices in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800
I1018 08:48:42.781451 1 register.go:89] Reporting devices in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544
I1018 08:49:12.816300 1 register.go:89] Reporting devices in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921
I1018 08:49:42.834850 1 register.go:89] Reporting devices in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732
I1018 08:50:12.855810 1 register.go:89] Reporting devices in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406
I1018 08:50:42.875763 1 register.go:89] Reporting devices in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247
I1018 08:51:12.892908 1 register.go:89] Reporting devices in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829
I1018 08:51:42.913563 1 register.go:89] Reporting devices in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924
I1018 08:52:12.938239 1 register.go:89] Reporting devices in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290
I1018 08:52:42.968125 1 register.go:89] Reporting devices in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731
I1018 08:53:12.988476 1 register.go:89] Reporting devices in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003
...
volcano-scheduler-configmap
root@k8s-tryvolcano-m001:~# kubectl get cm -n volcano-system volcano-scheduler-configmap -oyaml
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: true
enableReclaimable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
arguments:
predicate.VGPUEnable: true
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"volcano-scheduler.conf":"actions: \"enqueue, allocate, backfill\"\ntiers:\n- plugins:\n - name: priority\n - name: gang\n enablePreemptable: false\n - name: conformance\n- plugins:\n - name: overcommit\n - name: drf\n enablePreemptable: false\n - name: predicates\n - name: proportion\n - name: nodeorder\n - name: binpack\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"volcano-scheduler-configmap","namespace":"volcano-system"}}
creationTimestamp: "2023-09-21T04:44:44Z"
name: volcano-scheduler-configmap
namespace: volcano-system
resourceVersion: "4267609"
uid: 086455c9-7a0e-42b0-a938-4e56a6371207
can you successfully launch vgpu task?
can you successfully launch vgpu task?
No. The status of vcjob is pending
can you successfully launch vgpu task?
No. The status of vcjob is pending
Thanks for your reply, can provide the following information?
- gpu node annotations by using (kubectl describe node )
- can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml
can you successfully launch vgpu task?
No. The status of vcjob is pending
Thanks for your reply, can provide the following information?
- gpu node annotations by using (kubectl describe node )
The below is gpu node's annotations:
root@k8s-tryvolcano-m001:~# k describe node k8s-tryvolcano-w004
Name: k8s-tryvolcano-w004
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=k8s-tryvolcano-w004
kubernetes.io/os=linux
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.100.168/24
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.200.126
volumes.kubernetes.io/controller-managed-attach-detach: true
- can you launch the following example? https://github.com/volcano-sh/devices/blob/master/examples/vgpu-case02.yml
The below is description of podgroup when the example is launched.
root@k8s-tryvolcano-m001:~# k apply -f https://raw.githubusercontent.com/volcano-sh/devices/master/examples/vgpu-case02.yml
pod/pod1 created
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 3m24s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7 Pending 1 3m34s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-3bcd3bc5-f9e8-4600-b110-13eac02fe3d7
...
Spec:
Min Member: 1
Min Resources:
count/pods: 1
Pods: 1
requests.volcano.sh/vgpu-memory: 1024
requests.volcano.sh/vgpu-number: 1
volcano.sh/vgpu-memory: 1024
volcano.sh/vgpu-number: 1
Queue: default
Status:
Conditions:
Last Transition Time: 2023-10-26T09:20:01Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: 24434f25-8ee7-4a06-a929-aa01c49b80a0
Type: Unschedulable
Phase: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 3m48s (x12 over 3m59s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Normal Unschedulable 3m47s (x13 over 3m59s) volcano resource in cluster is overused
"resource in cluster is overused" message means job is reject by enqueue action.
"resource in cluster is overused" message means job is reject by enqueue action.
Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".
I1027 02:07:15.850507 1 proportion.go:230] The attributes of queue <default> in proportion: deserved <cpu 0.00, memory 0.00>, realCapability <cpu 10000.00, memory 24333668352.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00, volcano.sh/vgpu-number 1000.00>, elastic <cpu 0.00, memory 0.00>, share <0.00>
I1027 02:07:15.850531 1 proportion.go:242] Remaining resource is <cpu 10000.00, memory 24333668352.00, hugepages-2Mi 0.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00>
I1027 02:07:15.850555 1 proportion.go:244] Exiting when remaining is empty or no queue has more resource request: <cpu 10000.00, memory 24333668352.00, ephemeral-storage 467461047550000.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>
Note:
Since the past logs were no longer visible, pod1 was relaunched.
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 7m1s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup-8fe3417c-53b2-4933-bf99-fd4c4298675f Pending 1 7m4s
Upon checking the volcano-scheduler log, it seems that the cause is the absence of "volcano.sh/vgpu-number" in the "realCapability".
Yes, your node's describe show no volcano gpu informations!
Now volcano-device-plugin
pod on GPU node outputs "could not load NVML library".
root@k8s-tryvolcano-m001:~# k -n kube-system logs volcano-device-plugin-jtfxz
I1027 05:40:47.592928 1 main.go:77] Loading NVML
I1027 05:40:47.593106 1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1027 05:40:47.593135 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1027 05:40:47.593146 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1027 05:40:47.593169 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1027 05:40:47.593180 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1027 05:40:47.593211 1 main.go:44] failed to initialize NVML: could not load NVML library
How to reproduce it (as minimally and precisely as possible):
Prerequisites:
- kubernetes cluster v1.24.3 is running
- Installed volocano
Reproduce:
- Install nvidia drivers in new GPU worker node.
- Install nvidia-docker2 in new GPU worker node.
- Install kubernetes in new GPU worker node.
- Join new GPU worker node to kubernetes cluster.
- Install volcano-vgpu-plugin.
Note: I refered to https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_vgpu.md.
Unfortunately, the above reproduction steps were not accurate. The below is omitted:
- Prerequisites
- GPU Operator is installed using Helm in the kubernetes cluster.
- Reproduce
- GPU Operator is uninstalled between steps 4 and 5.
In other words, the fact that the NVML library was successfully loaded in the first log (below) might be due to the influence of the GPU operator.
"volocano-device-plugin" pod log
I1018 08:42:42.247448 1 main.go:77] Loading NVML I1018 08:42:42.317422 1 main.go:91] Starting FS watcher. I1018 08:42:42.317465 1 main.go:98] Starting OS watcher. I1018 08:42:42.317759 1 main.go:116] Retreiving plugins. I1018 08:42:42.317770 1 main.go:155] No devices found. Waiting indefinitely. I1018 08:42:42.317783 1 register.go:101] into WatchAndRegister I1018 08:42:42.360498 1 register.go:89] Reporting devices in 2023-10-18 08:42:42.360494312 +0000 UTC m=+0.116513880 I1018 08:43:12.468827 1 register.go:89] Reporting devices in 2023-10-18 08:43:12.468819399 +0000 UTC m=+30.224838968 I1018 08:43:42.485190 1 register.go:89] Reporting devices in 2023-10-18 08:43:42.485182962 +0000 UTC m=+60.241202532 I1018 08:44:12.505930 1 register.go:89] Reporting devices in 2023-10-18 08:44:12.505920612 +0000 UTC m=+90.261940182 I1018 08:44:42.523805 1 register.go:89] Reporting devices in 2023-10-18 08:44:42.523797163 +0000 UTC m=+120.279816722 I1018 08:45:12.542654 1 register.go:89] Reporting devices in 2023-10-18 08:45:12.542646375 +0000 UTC m=+150.298665943 I1018 08:45:42.564609 1 register.go:89] Reporting devices in 2023-10-18 08:45:42.564600701 +0000 UTC m=+180.320620270 I1018 08:46:12.584788 1 register.go:89] Reporting devices in 2023-10-18 08:46:12.584777812 +0000 UTC m=+210.340797381 I1018 08:46:42.653138 1 register.go:89] Reporting devices in 2023-10-18 08:46:42.653129051 +0000 UTC m=+240.409148620 I1018 08:47:12.674599 1 register.go:89] Reporting devices in 2023-10-18 08:47:12.674591614 +0000 UTC m=+270.430611183 I1018 08:47:42.690977 1 register.go:89] Reporting devices in 2023-10-18 08:47:42.69097107 +0000 UTC m=+300.446990640 I1018 08:48:12.707222 1 register.go:89] Reporting devices in 2023-10-18 08:48:12.707213231 +0000 UTC m=+330.463232800 I1018 08:48:42.781451 1 register.go:89] Reporting devices in 2023-10-18 08:48:42.781437965 +0000 UTC m=+360.537457544 I1018 08:49:12.816300 1 register.go:89] Reporting devices in 2023-10-18 08:49:12.816292362 +0000 UTC m=+390.572311921 I1018 08:49:42.834850 1 register.go:89] Reporting devices in 2023-10-18 08:49:42.834844163 +0000 UTC m=+420.590863732 I1018 08:50:12.855810 1 register.go:89] Reporting devices in 2023-10-18 08:50:12.855797817 +0000 UTC m=+450.611817406 I1018 08:50:42.875763 1 register.go:89] Reporting devices in 2023-10-18 08:50:42.875755678 +0000 UTC m=+480.631775247 I1018 08:51:12.892908 1 register.go:89] Reporting devices in 2023-10-18 08:51:12.89289625 +0000 UTC m=+510.648915829 I1018 08:51:42.913563 1 register.go:89] Reporting devices in 2023-10-18 08:51:42.913556355 +0000 UTC m=+540.669575924 I1018 08:52:12.938239 1 register.go:89] Reporting devices in 2023-10-18 08:52:12.93823072 +0000 UTC m=+570.694250290 I1018 08:52:42.968125 1 register.go:89] Reporting devices in 2023-10-18 08:52:42.968118172 +0000 UTC m=+600.724137731 I1018 08:53:12.988476 1 register.go:89] Reporting devices in 2023-10-18 08:53:12.988467434 +0000 UTC m=+630.744487003 ...
@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?
@dojoeisuke can i see /etc/docker/daemon.json on that GPU node?
root@k8s-tryvolcano-w004:~# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
can this issue be reproduced without install Gpu Operator?
can this issue be reproduced without install Gpu Operator?
I tried it.
volocano-device-plugin
pod on GPU node produced the following error output.
I1030 05:12:02.805254 1 main.go:77] Loading NVML
I1030 05:12:02.805419 1 main.go:79] Failed to initialize NVML: could not load NVML library.
I1030 05:12:02.805428 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`?
I1030 05:12:02.805431 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
I1030 05:12:02.805467 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
I1030 05:12:02.805473 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
F1030 05:12:02.805498 1 main.go:44] failed to initialize NVML: could not load NVML library
Also, the example manifest was not scheduling to GPU node.
root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 9m8s
can this issue be reproduced without install Gpu Operator?
I tried it.
volocano-device-plugin
pod on GPU node produced the following error output.I1030 05:12:02.805254 1 main.go:77] Loading NVML I1030 05:12:02.805419 1 main.go:79] Failed to initialize NVML: could not load NVML library. I1030 05:12:02.805428 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`? I1030 05:12:02.805431 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites I1030 05:12:02.805467 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start I1030 05:12:02.805473 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes F1030 05:12:02.805498 1 main.go:44] failed to initialize NVML: could not load NVML library
Also, the example manifest was not scheduling to GPU node.
root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po NAME READY STATUS RESTARTS AGE pod1 0/1 Pending 0 9m8s
Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works
There was an inadequacy in preparing the GPU node. In Kubernetes 1.24, it was necessary to install cri-dockerd and specify cri-dockerd as the cri-socket for "kubelet".
- https://github.com/Mirantis/cri-dockerd
As a result, "volcano.sh/vgpu-number" is inclued in "allocatable" as expected.
root@k8s-tryvolcano-m001:~# k get node k8s-tryvolcano-w004 -ojson | jq .status.allocatable
{
"cpu": "2",
"ephemeral-storage": "93492209510",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "8050772Ki",
"pods": "110",
"volcano.sh/vgpu-number": "10"
}
Next I tried to launch a example manifest,
Note: the following fields was changed:
- image:
nvidia/cuda:10.1-base-ubuntu18.04
->nvidia/cuda:12.1.0-base-ubuntu18.04
- vgpu-number:
1
->2
it failed due to the lack of resources.
root@k8s-tryvolcano-m001:~# k get po
NAME READY STATUS RESTARTS AGE
pod1 0/1 Pending 0 80s
root@k8s-tryvolcano-m001:~# k get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5 Inqueue 1 47s
root@k8s-tryvolcano-m001:~# k describe podgroup podgroup-d893dffa-4407-4b36-a9e9-3e031b0224f5
(snip)
Spec:
Min Member: 1
Min Resources:
count/pods: 1
Pods: 1
requests.volcano.sh/vgpu-memory: 1024
requests.volcano.sh/vgpu-number: 2
volcano.sh/vgpu-memory: 1024
volcano.sh/vgpu-number: 2
Queue: default
Status:
Conditions:
Last Transition Time: 2023-10-30T07:59:03Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: 84edb100-71c5-44d7-8c55-c5dabd7ae74f
Type: Unschedulable
Phase: Inqueue
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 1s (x13 over 13s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Does this mean there is still an inadequacy in preparing the GPU node?
can this issue be reproduced without install Gpu Operator?
I tried it.
volocano-device-plugin
pod on GPU node produced the following error output.I1030 05:12:02.805254 1 main.go:77] Loading NVML I1030 05:12:02.805419 1 main.go:79] Failed to initialize NVML: could not load NVML library. I1030 05:12:02.805428 1 main.go:80] If this is a GPU node, did you set the docker default runtime to `nvidia`? I1030 05:12:02.805431 1 main.go:81] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites I1030 05:12:02.805467 1 main.go:82] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start I1030 05:12:02.805473 1 main.go:83] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes F1030 05:12:02.805498 1 main.go:44] failed to initialize NVML: could not load NVML library
Also, the example manifest was not scheduling to GPU node.
root@k8s-tryvolcano-m001:~/gpu-check/devices# k get po NAME READY STATUS RESTARTS AGE pod1 0/1 Pending 0 9m8s
Try the following command on GPU node: docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash when inside container, try "nvidia-smi" see if it can works
It was successful.
root@k8s-tryvolcano-w004:~# docker run -it --rm -e=NVIDIA_VISIBLE_DEVICES=0 --runtime=nvidia ubuntu:18.04 bash
Unable to find image 'ubuntu:18.04' locally
18.04: Pulling from library/ubuntu
7c457f213c76: Pull complete
Digest: sha256:152dc042452c496007f07ca9127571cb9c29697f42acbfad72324b2bb2e43c98
Status: Downloaded newer image for ubuntu:18.04
root@3b1a7f3abe05:/# nvidia-smi
Mon Oct 30 08:23:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:00:05.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@3b1a7f3abe05:/# exit
exit
root@k8s-tryvolcano-w004:~#
@archlitchi
About https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784644826 , since "volocano.sh/vgpu-number" has become part of the allocatable resources, would it be better to close this issue? Also, should I submit a new issue about https://github.com/volcano-sh/volcano/issues/3160#issuecomment-1784664057 ?
Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2
Is your kubernetes cluster set up to use docker or containerd as its underlying container runtime? If it’s containerd, you need to follow the instructions under the containerd tab here to set it up: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2
The above URL seems to redirects to https://docs.nvidia.com/datacenter/cloud-native/index.html. Is the following URL correct? https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html
Is your problem fixed?@dojoeisuke. And is it caused by docker removed in kubernets v1.24? @archlitchi
@Monokaix
The problem has not been resolved, but I personally find it difficult to continue the investigation, so I will temporarily close this issue. Thank you for your support. @archlitchi