Pending status for cow-job
I was trying to run the cow-job after setup environments by the following command:
vagrant up && vagrant ssh k8s-master
kubectl apply -f examples/cow.yaml
but when I run kubectl get pods my cow-job is "Pending":
NAME READY STATUS RESTARTS AGE
cow-job 0/1 Pending 0 13s
wlm-operator-ffddd8795-lz98t 1/1 Running 0 16m
Have you figured it out? Having the same problem of SlurmJobs not initiating.
It seems like they aren't being assigned to the virtual-kubelets, despite ensuring the virtual kubelets have both labels:
k describe pod cow-job
Name: cow-job
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: SlurmJob/cow
Containers:
jt1:
Image: no-image
Port: <none>
Host Port: <none>
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-b86xw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-b86xw:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-b86xw
Optional: false
QoS Class: BestEffort
Node-Selectors: type=virtual-kubelet
wlm.sylabs.io/containers=singularity
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
virtual-kubelet.io/provider=wlm:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/7 nodes are available: 7 node(s) didn't match node selector.
k get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
qpod3-cn01 Ready <none> 10d v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn01,kubernetes.io/os=linux
qpod3-cn02 Ready <none> 10d v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn02,kubernetes.io/os=linux
qpod3-cn03 Ready <none> 10d v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn03,kubernetes.io/os=linux
qpod3-k8s-master Ready master 10d v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
slurm-qpod3-cn01-cpn Ready agent 49m v1.13.1-vk-N/A alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn01-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn02-cpn Ready agent 49m v1.13.1-vk-N/A alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn02-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn03-cpn Ready agent 49m v1.13.1-vk-N/A alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn03-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
Hi, yeah, you need to change the nodeSelector at your cow-job config file to one of the existing nodes (e.g., slurm-qpod3-cn01-cpn).
@pisarukv I really appreciate the response. I assume you're referring to the "virtual-kubelet" node?
The slurmjob's pod still isn't being assigned to any node, even after adding kubernetes.io/hostname: slurm-qpod3-cn03-cp to Yaml. No matter how congurent, the pod still fails scheduling, citing no matching no selectors.
If it's not too much trouble, would you mind showing me the output for the following commands?
kubectl get nodes -o wide --show-labels
kubectl describe pods cow-job
kubectl describe describe slurmjobs.wlm.sylabs.io cow
kubectl logs wlm-operator.......
Thanks again.
Yes, I'm referring to the virtual-kubelet nodes. I attached the output of the commands you mentioned. logs.txt describeSlurmCow.txt describePods.txt getNodes.txt
Many thanks! I think my issue may stem from the fact that I was running the k8s master and slurm master(with slurmctld) as the same node. I've set up a separate test env from out dev environment and got it working.
Thanks again!