wlm-operator icon indicating copy to clipboard operation
wlm-operator copied to clipboard

Pending status for cow-job

Open pouya-codes opened this issue 5 years ago • 5 comments

I was trying to run the cow-job after setup environments by the following command: vagrant up && vagrant ssh k8s-master kubectl apply -f examples/cow.yaml

but when I run kubectl get pods my cow-job is "Pending": NAME READY STATUS RESTARTS AGE cow-job 0/1 Pending 0 13s wlm-operator-ffddd8795-lz98t 1/1 Running 0 16m

pouya-codes avatar Jul 02 '20 19:07 pouya-codes

Have you figured it out? Having the same problem of SlurmJobs not initiating.

It seems like they aren't being assigned to the virtual-kubelets, despite ensuring the virtual kubelets have both labels: k describe pod cow-job

Name:           cow-job
Namespace:      default
Priority:       0
Node:           <none>
Labels:         <none>
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  SlurmJob/cow
Containers:
  jt1:
    Image:        no-image
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-b86xw (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-b86xw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-b86xw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  type=virtual-kubelet
                 wlm.sylabs.io/containers=singularity
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 virtual-kubelet.io/provider=wlm:NoSchedule
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/7 nodes are available: 7 node(s) didn't match node selector.

k get nodes --show-labels

NAME                   STATUS   ROLES    AGE   VERSION          LABELS
qpod3-cn01             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn01,kubernetes.io/os=linux
qpod3-cn02             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn02,kubernetes.io/os=linux
qpod3-cn03             Ready    <none>   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-cn03,kubernetes.io/os=linux
qpod3-k8s-master       Ready    master   10d   v1.17.4          beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=qpod3-k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
slurm-qpod3-cn01-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn01-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn02-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn02-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity
slurm-qpod3-cn03-cpn   Ready    agent    49m   v1.13.1-vk-N/A   alpha.service-controller.kubernetes.io/exclude-balancer=true,beta.kubernetes.io/os=linux,kubernetes.io/hostname=slurm-qpod3-cn03-cpn,kubernetes.io/os=linux,kubernetes.io/role=agent,type=virtual-kubelet,wlm.sylabs.io/containers=singularity

adamwoolhether avatar Dec 11 '20 03:12 adamwoolhether

Hi, yeah, you need to change the nodeSelector at your cow-job config file to one of the existing nodes (e.g., slurm-qpod3-cn01-cpn).

pouya-codes avatar Dec 16 '20 00:12 pouya-codes

@pisarukv I really appreciate the response. I assume you're referring to the "virtual-kubelet" node?

The slurmjob's pod still isn't being assigned to any node, even after adding kubernetes.io/hostname: slurm-qpod3-cn03-cp to Yaml. No matter how congurent, the pod still fails scheduling, citing no matching no selectors.

If it's not too much trouble, would you mind showing me the output for the following commands? kubectl get nodes -o wide --show-labels kubectl describe pods cow-job kubectl describe describe slurmjobs.wlm.sylabs.io cow kubectl logs wlm-operator.......

Thanks again.

adamwoolhether avatar Dec 16 '20 04:12 adamwoolhether

Yes, I'm referring to the virtual-kubelet nodes. I attached the output of the commands you mentioned. logs.txt describeSlurmCow.txt describePods.txt getNodes.txt

pouya-codes avatar Dec 17 '20 20:12 pouya-codes

Many thanks! I think my issue may stem from the fact that I was running the k8s master and slurm master(with slurmctld) as the same node. I've set up a separate test env from out dev environment and got it working.

Thanks again!

adamwoolhether avatar Dec 21 '20 00:12 adamwoolhether