I'm trying GPU scheduler, however POD gpushare-schd-extender is in PENDING state. My environment:

microk8s: v1.23.3
OS: Ubuntu 20.04.3 LTS
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:0B:00.0 Off |                    0 |
| N/A   34C    P8    11W /  70W |      4MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A      1372      G   /usr/lib/xorg/Xorg                  4MiB |

Output of POD describe:

administrator@ubuntu1:~$ kubectl describe pod gpushare-schd-extender-778bf88d65-fklqv -n kube-system
Name:           gpushare-schd-extender-778bf88d65-fklqv
Namespace:      kube-system
Priority:       0
Node:           <none>
Labels:         app=gpushare
Status:         Pending
IPs:            <none>
Controlled By:  ReplicaSet/gpushare-schd-extender-778bf88d65
    Image:      localhost:32000/k8s-gpushare-schd-extender:1.11-d170d8a
    Port:       <none>
    Host Port:  <none>
      LOG_LEVEL:  debug
      PORT:       12345
      /var/run/secrets/ from kube-api-access-5p7cv (ro)
  Type           Status
  PodScheduled   False
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Tolerations:        op=Exists
                    op=Exists for 300s
                    op=Exists for 300s
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  87s (x687 over 11h)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.

Node Labels:

administrator@ubuntu1:~$ kubectl get nodes --show-labels
NAME      STATUS   ROLES    AGE   VERSION                    LABELS
ubuntu1   Ready    <none>   12h   v1.23.3-2+d441060727c463   app=gpushare,,,**gpushare=true**,,,,,

My docker configuration:

administrator@ubuntu1:~$ cat /etc/docker/daemon.json
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []

The device plugin POD doesn't seem to be functioning correctly, when I tried to get logs I got following error:

administrator@ubuntu1:~$ kubectl logs gpushare-device-plugin-ds-b5kfw -n kube-system
Error from server (Forbidden): Forbidden (user=, verb=get, resource=nodes, subresource=proxy) ( pods/log gpushare-device-plugin-ds-b5kfw)

What is that I'm missing? Kindly suggest.

m1nish1208 avatar Feb 08 '22 05:02 m1nish1208

I have the same problem using minikube. The scheduler remains in PENDING state and so nothing works

mknnj avatar Feb 08 '22 10:02 mknnj

Yea... Can someone suggest what I've missed.....

m1nish1208 avatar Feb 08 '22 11:02 m1nish1208

I was able to make some progress on this, all previously mentioned problems are fixed. I'm now getting following error while running a test POD binpack, here are the logs from gpushare-schd-extender POD:

[ debug ] 2022/02/09 17:06:31 controller.go:176: begin to sync gpushare pod binpack-1-5fb868d569-v6hp5 in ns default
[ debug ] 2022/02/09 17:06:31 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:binpack-1-5fb868d569-v6hp5,GenerateName:binpack-1-5fb868d569-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/binpack-1-5fb868d569-v6hp5,UID:dd5d948e-a702-4f2c-ac23-7fc89fe1250e,ResourceVersion:11119,Generation:0,CreationTimestamp:2022-02-09 17:06:31 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: binpack-1,pod-template-hash: 5fb868d569,},Annotations:map[string]string{},OwnerReferences:[{apps/v1 ReplicaSet binpack-1-5fb868d569 4bfe63bf-57e2-4d3c-a33c-6d51436fbbfc 0xc42004140a 0xc42004140b}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{kube-api-access-rkhwn {nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil ProjectedVolumeSource{Sources:[{nil nil nil ServiceAccountTokenProjection{Audience:,ExpirationSeconds:*3607,Path:token,}} {nil nil &ConfigMapProjection{LocalObjectReference:LocalObjectReference{Name:kube-root-ca.crt,},Items:[{ca.crt ca.crt <nil>}],Optional:nil,} nil} {nil &DownwardAPIProjection{Items:[{namespace ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,} nil <nil>}],} nil nil}],DefaultMode:*420,} nil nil nil}}],Containers:[{binpack-1 localhost:32000/cheyang/gpu-player:v2 [] []  [] [] [] {map[{{8192 0} {<nil>} 8192 DecimalSI}] map[{{8192 0} {<nil>} 8192 DecimalSI}]} [{kube-api-access-rkhwn true /var/run/secrets/  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{ Exists  NoExecute 0xc4200418b0} { Exists  NoExecute 0xc4200418d0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}
[ debug ] 2022/02/09 17:06:31 cache.go:91: Node map[]
[ debug ] 2022/02/09 17:06:31 cache.go:93: pod binpack-1-5fb868d569-v6hp5 in ns default is not assigned to any node, skip
[  info ] 2022/02/09 17:06:31 controller.go:223: end processNextWorkItem()
[ debug ] 2022/02/09 17:06:31 controller.go:295: No need to update pod name binpack-1-5fb868d569-v6hp5 in ns default and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[]
[  info ] 2022/02/09 17:06:32 controller.go:210: begin processNextWorkItem()
[ debug ] 2022/02/09 17:06:46 controller.go:295: No need to update pod name binpack-1-5fb868d569-v6hp5 in ns default and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[]
[ debug ] 2022/02/09 17:07:16 controller.go:295: No need to update pod name binpack-1-5fb868d569-v6hp5 in ns default and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[]

What is that I'm missing now? Can anyone suggest?

m1nish1208 avatar Feb 09 '22 09:02 m1nish1208

Yea... Can someone suggest what I've missed.....

How to fixed pending state?

631068264 avatar Apr 14 '22 03:04 631068264

just delete nodeSelector in gpushare-schd-extender.yaml

631068264 avatar Apr 14 '22 08:04 631068264

How did you set this up on microk8s?

Naegionn avatar Jun 01 '22 15:06 Naegionn

just delete nodeSelector in gpushare-schd-extender.yaml

Does removing nodeSelector affect usage, and on which node should this component run?

db-root avatar Jul 27 '23 03:07 db-root