gpushare-scheduler-extender
gpushare-scheduler-extender copied to clipboard
gpushare-schd-extender in Pending State
Hi,
I'm trying GPU scheduler, however POD gpushare-schd-extender is in PENDING state. My environment:
microk8s: v1.23.3
OS: Ubuntu 20.04.3 LTS
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:0B:00.0 Off | 0 |
| N/A 34C P8 11W / 70W | 4MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1372 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
Output of POD describe:
administrator@ubuntu1:~$ kubectl describe pod gpushare-schd-extender-778bf88d65-fklqv -n kube-system
Name: gpushare-schd-extender-778bf88d65-fklqv
Namespace: kube-system
Priority: 0
Node: <none>
Labels: app=gpushare
component=gpushare-schd-extender
pod-template-hash=778bf88d65
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/gpushare-schd-extender-778bf88d65
Containers:
gpushare-schd-extender:
Image: localhost:32000/k8s-gpushare-schd-extender:1.11-d170d8a
Port: <none>
Host Port: <none>
Environment:
LOG_LEVEL: debug
PORT: 12345
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5p7cv (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-5p7cv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists
node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 87s (x687 over 11h) default-scheduler 0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
administrator@ubuntu1:~$
Node Labels:
administrator@ubuntu1:~$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
ubuntu1 Ready <none> 12h v1.23.3-2+d441060727c463 app=gpushare,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,**gpushare=true**,kubernetes.io/arch=amd64,kubernetes.io/hostname=ubuntu1,kubernetes.io/os=linux,microk8s.io/cluster=true,node.kubernetes.io/microk8s-controlplane=microk8s-controlplane
administrator@ubuntu1:~$
My docker configuration:
administrator@ubuntu1:~$ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
The device plugin POD doesn't seem to be functioning correctly, when I tried to get logs I got following error:
administrator@ubuntu1:~$ kubectl logs gpushare-device-plugin-ds-b5kfw -n kube-system
Error from server (Forbidden): Forbidden (user=127.0.0.1, verb=get, resource=nodes, subresource=proxy) ( pods/log gpushare-device-plugin-ds-b5kfw)
What is that I'm missing? Kindly suggest.
I have the same problem using minikube. The scheduler remains in PENDING state and so nothing works
Yea... Can someone suggest what I've missed.....
I was able to make some progress on this, all previously mentioned problems are fixed. I'm now getting following error while running a test POD binpack, here are the logs from gpushare-schd-extender POD:
[ debug ] 2022/02/09 17:06:31 controller.go:176: begin to sync gpushare pod binpack-1-5fb868d569-v6hp5 in ns default
[ debug ] 2022/02/09 17:06:31 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:binpack-1-5fb868d569-v6hp5,GenerateName:binpack-1-5fb868d569-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/binpack-1-5fb868d569-v6hp5,UID:dd5d948e-a702-4f2c-ac23-7fc89fe1250e,ResourceVersion:11119,Generation:0,CreationTimestamp:2022-02-09 17:06:31 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: binpack-1,pod-template-hash: 5fb868d569,},Annotations:map[string]string{},OwnerReferences:[{apps/v1 ReplicaSet binpack-1-5fb868d569 4bfe63bf-57e2-4d3c-a33c-6d51436fbbfc 0xc42004140a 0xc42004140b}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{kube-api-access-rkhwn {nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil ProjectedVolumeSource{Sources:[{nil nil nil ServiceAccountTokenProjection{Audience:,ExpirationSeconds:*3607,Path:token,}} {nil nil &ConfigMapProjection{LocalObjectReference:LocalObjectReference{Name:kube-root-ca.crt,},Items:[{ca.crt ca.crt <nil>}],Optional:nil,} nil} {nil &DownwardAPIProjection{Items:[{namespace ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,} nil <nil>}],} nil nil}],DefaultMode:*420,} nil nil nil}}],Containers:[{binpack-1 localhost:32000/cheyang/gpu-player:v2 [] [] [] [] [] {map[aliyun.com/gpu-mem:{{8192 0} {<nil>} 8192 DecimalSI}] map[aliyun.com/gpu-mem:{{8192 0} {<nil>} 8192 DecimalSI}]} [{kube-api-access-rkhwn true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4200418b0} {node.kubernetes.io/unreachable Exists NoExecute 0xc4200418d0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}
[ debug ] 2022/02/09 17:06:31 cache.go:91: Node map[]
[ debug ] 2022/02/09 17:06:31 cache.go:93: pod binpack-1-5fb868d569-v6hp5 in ns default is not assigned to any node, skip
[ info ] 2022/02/09 17:06:31 controller.go:223: end processNextWorkItem()
[ debug ] 2022/02/09 17:06:31 controller.go:295: No need to update pod name binpack-1-5fb868d569-v6hp5 in ns default and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[]
[ info ] 2022/02/09 17:06:32 controller.go:210: begin processNextWorkItem()
[ debug ] 2022/02/09 17:06:46 controller.go:295: No need to update pod name binpack-1-5fb868d569-v6hp5 in ns default and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[]
[ debug ] 2022/02/09 17:07:16 controller.go:295: No need to update pod name binpack-1-5fb868d569-v6hp5 in ns default and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[]
What is that I'm missing now? Can anyone suggest?
Yea... Can someone suggest what I've missed.....
How to fixed pending state?
just delete nodeSelector in gpushare-schd-extender.yaml
How did you set this up on microk8s?
just delete nodeSelector in gpushare-schd-extender.yaml
Does removing nodeSelector affect usage, and on which node should this component run?