reana icon indicating copy to clipboard operation
reana copied to clipboard

k8s: check and possibly optimise the launch of pending pods

Open tiborsimko opened this issue 1 year ago • 2 comments

There was a situation in a cluster running many concurrent workflows, which generated many jobs, that many jobs were pending, because the cluster did not have enough memory resources to run them all.

For example, here is one snapshot in time:

$ kgp | grep reana-run-j | grep -c Running
110

$ kgp | grep reana-run-j | grep -c Pending
71

This means that only 60% of jobs could be running, the remaining 40% were pending. (Some for many hours.)

Some nodes were really busy, for example:

$ kubectl top nodes -l reana.io/system=runtimejobs
NAME                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
mycluster-node-12   4175m        52%    10308Mi         73%

$ kgp | grep node-12
reana-run-job-bda3d5d0-23da-494d-9973-5992707fed3f-hg8js          1/1     Running     0          11h     10.100.164.56    mycluster-node-12   <none>           <none>
reana-run-job-cd68d874-5bf5-44ba-a9bf-d4425ed5f466-prslv          1/1     Running     0          4h      10.100.164.12    mycluster-node-12   <none>           <none>
reana-run-job-ea877fa3-e346-4e3b-92d0-d7dabf2ce66b-sm695          1/1     Running     0          15h     10.100.164.63    mycluster-node-12   <none>           <none>
reana-run-job-fdac7a57-7e53-434b-a7e9-0231803bbcfa-8gnd2          1/1     Running     0          11h     10.100.164.14    mycluster-node-12   <none>           <none>

However, other nodes were less so, for example:

$ kubectl top nodes -l reana.io/system=runtimejobs
NAME                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
mycluster-node-33   3156m        39%    8596Mi          61%
mycluster-node-35   2073m        25%    5263Mi          37%

$ kgp | grep node-33
reana-run-job-4f95d722-7a41-456c-9437-4c7842f416aa-fh428          1/1     Running       0          21m     10.100.65.74     mycluster-node-33   <none>           <none>
reana-run-job-751198a7-3596-473a-bb29-7ab9e97456ad-2jscw          1/1     Running       0          63m     10.100.65.66     mycluster-node-33   <none>           <none>
reana-run-job-9ab284e3-2e9e-47f2-be86-37664109fcb4-p2z6p          1/1     Running       0          63m     10.100.65.116    mycluster-node-33   <none>           <none>

$ kgp | grep node-35
reana-run-job-9659c2d7-403a-4392-8b7b-1d1525be3bee-lh9bq          1/1     Running     0          5m13s   10.100.12.239    mycluster-node-35   <none>           <none>
reana-run-job-9c71d885-3e6c-423d-a4ad-d3da2e9df5d8-rb69x          1/1     Running     0          3m42s   10.100.12.244    mycluster-node-35   <none>           <none>

It seems that our pending pods aren't consumed as rapidly as they in theory could (e.g. the node-33 and node-35 above had free capacity).

Here is one such Pending pod described:

$ kubectl describe pod reana-run-job-ea984ce5-6f04-4b92-893e-6d271d2a5454-22gnp | tail -7
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  81m    default-scheduler  0/62 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 14 node(s) didn't match Pod's node affinity/selector, 39 Insufficient memory, 8 node(s) were unschedulable.
  Warning  FailedScheduling  11m    default-scheduler  0/62 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 14 node(s) didn't match Pod's node affinity/selector, 38 Insufficient memory, 8 node(s) were unschedulable.
  Warning  FailedScheduling  5m44s  default-scheduler  0/62 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 14 node(s) didn't match Pod's node affinity/selector, 39 Insufficient memory, 8 node(s) were unschedulable.

Let's verify our Kubernetes cluster settings related to the behaviour of Pending pods and let's see whether we could make the memory-checks and the scheduling of these pending pods faster.

tiborsimko avatar Dec 04 '22 20:12 tiborsimko