volcano icon indicating copy to clipboard operation
volcano copied to clipboard

imagelocality.weight can not take affect when using nodeorder plugin

Open zhifanggao opened this issue 2 years ago • 2 comments

What happened: imagelocality.weight can not take affect when using nodeorder plugin What you expected to happen: imagelocality.weight works when using nodeorder plugin How to reproduce it (as minimally and precisely as possible):

  1. there are 3 worker nodes in my environment
[root@host-10-19-37-28 volcano]# kubectl get node
NAME               STATUS   ROLES                  AGE    VERSION
host-10-19-37-27   Ready    <none>                 142d   v1.22.2
host-10-19-37-28   Ready    control-plane,master   147d   v1.22.2
host-10-19-37-29   Ready    <none>                 147d   v1.22.2
host-10-19-37-34   Ready    <none>                 145d   v1.22.2
  1. job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vc-gzfjob1
  namespace: test
spec:
  # minAvailable: 0
  schedulerName: volcano
  queue: test
  priorityClassName: high-priority
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 3
      name: gzfjob1
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          priorityClassName: high-priority
          containers:
            - command:
              - sleep
              - 10m
              image: nginx:latest
              name: nginx
              resources:
                requests:
                  cpu: 200m
                limits:
                  cpu: 200m
          restartPolicy: OnFailure
  1. only host-10-19-37-34 already has nginx:latest image
  2. scheduler config:
      - name: nodeorder
        arguments:
          nodeaffinity.weight: 0
          podaffinity.weight: 0
          leastrequested.weight: 0
          balancedresource.weight: 0
          mostrequested.weight: 0
          tainttoleration.weight: 0
          imagelocality.weight: 100
  1. deploy the job
[root@host-10-19-37-28 volcano]# kubectl create -f ./queuejob.yaml
job.batch.volcano.sh/vc-gzfjob1 created
[root@host-10-19-37-28 volcano]# kubectl -n test get po
NAME                   READY   STATUS              RESTARTS   AGE
vc-gzfjob1-gzfjob1-0   0/1     ContainerCreating   0          8s
vc-gzfjob1-gzfjob1-1   0/1     ContainerCreating   0          8s
vc-gzfjob1-gzfjob1-2   0/1     ContainerCreating   0          8s
[root@host-10-19-37-28 volcano]# kubectl -n test get po -o wide
NAME                   READY   STATUS              RESTARTS   AGE   IP       NODE               NOMINATED NODE   READINESS GATES
vc-gzfjob1-gzfjob1-0   0/1     ContainerCreating   0          12s   <none>   host-10-19-37-27   <none>           <none>
vc-gzfjob1-gzfjob1-1   0/1     ContainerCreating   0          12s   <none>   host-10-19-37-27   <none>           <none>
vc-gzfjob1-gzfjob1-2   0/1     ContainerCreating   0          12s   <none>   host-10-19-37-27   <none>           <none>

It choose node host-10-19-37-27 instead.

Anything else we need to know?:

Environment:

  • Volcano Version:1.16
  • Kubernetes version (use kubectl version):1.22
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release): centos 7.5
  • Kernel (e.g. uname -a): 3.10.0-957.27.2.el7.x86_64
  • Install tools: kubeadmin
  • Others:

zhifanggao avatar Sep 16 '22 10:09 zhifanggao

Add the debug info into session.go

snapshot := cache.Snapshot()
klog.Warningf("3333333333 argument: %v", snapshot.Nodes)

check the output of session cache

W0919 10:47:32.894246       1 session.go:142] 3333333333 argument: map[host-10-19-37-27:Node (host-10-19-37-27): allocatable<cpu 8000.00, memory 33631535104.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> idle <cpu 6900.00, memory 32348078080.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00>, used <cpu 1100.00, memory 1283457024.00>, releasing <cpu 0.00, memory 0.00>, oversubscribution <cpu 0.00, memory 0.00>, state <phase Ready, reaseon >, oversubscributionNode <false>, offlineJobEvicting <false>,taints <[]>
         0: Task (4b7873f0-d37a-4386-97ab-76daf0a21692:volcano-system/volcano-scheduler-796fbd96b9-pbqmh): job , status Running, pri 2000000000resreq cpu 0.00, memory 0.00, preemptable false, revocableZone , numaInfo { map[]}
         1: Task (8509439a-5694-4c2d-8c39-2549b59e053f:hive-instance-hive1/metastore-server-66f77bd7-6jdkq): job , status Running, pri 1000000resreq cpu 1000.00, memory 1073741824.00, preemptable false, revocableZone , numaInfo { map[]}
         2: Task (42235660-1d07-4065-b9d0-e1cc03d1f517:kube-system/kube-proxy-w4l4h): job , status Running, pri 2000001000resreq cpu 0.00, memory 0.00, preemptable false, revocableZone , numaInfo { map[]}
         3: Task (c848bfca-523c-421a-8c22-42ad6fc46a44:kube-system/weave-net-qfhgt): job , status Running, pri 2000001000resreq cpu 100.00, memory 209715200.00, preemptable false, revocableZone , numaInfo { map[]} host-10-19-37-28:Node (host-10-19-37-28): allocatable<cpu 8000.00, memory 33631531008.00, hugepages-1Gi 0.00, hugepages-2Mi 0.00> idle <cpu 7050.00, memory 33170157568.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00>, used <cpu 950.00, memory 461373440.00>, releasing <cpu 0.00, memory 0.00>, oversubscribution <cpu 0.00, memory 0.00>, state <phase Ready, reaseon >, oversubscributionNode <false>, offlineJobEvicting <false>,taints <[{node-role.kubernetes.io/master  NoSchedule <nil>}]>
         0: Task (5966159e-3b2d-44e2-a87a-5974b70d9cf7:kube-system/kube-apiserver-host-10-19-37-28): job , status Running, pri 2000001000resreq cpu 250.00, memory 0.00, preemptable false, revocableZone , numaInfo { map[]}

It looks like the images on the node is not included in the cache

zhifanggao avatar Sep 19 '22 11:09 zhifanggao

@wangyang0616 please take a look at this issue :)

william-wang avatar Sep 20 '22 07:09 william-wang

According to the method provided by @zhifanggao, the imagelocality policy does not take effect.

When the default scheduler of kube-scheduler is used, the imagelocality policy still does not take effect. It is suspected that the scoring logic of the imagelocality policy of K8S is incorrect.

An issue has been created in the K8S community. https://github.com/kubernetes/kubernetes/issues/112699

wangyang0616 avatar Sep 24 '22 01:09 wangyang0616

The cause of the problem is found. The image name configured in the YAML file does not match the image name recorded on the node.

For example, if the name of the image on the node is docker.io/library/nginx:latest, the image in the YAML file is nginx:latest, and the K8S does not implement intelligent matching of the image prefix name. When the K8S performs scheduling scoring, The system considers that the nginx image does not exist on all work nodes. As a result, the imagelocality policy becomes invalid.

You can try the following methods to solve the problem: Run the docker images command to query the local image name, change the image name in the YAML file to be the same as the image name of the node, and perform scheduling again.

wangyang0616 avatar Sep 24 '22 09:09 wangyang0616

fix PR: https://github.com/volcano-sh/volcano/pull/2512

zhifanggao avatar Sep 28 '22 06:09 zhifanggao

the node info is included in (* Node) which got from apiserver , Then save into schedulercache-------> snapshot()--------->nodemap in nodeorder.go. The image information on nodes exists in (* Node) , But it lost in schedulercache, snapshot() and nodemap in nodeorder.go. So the score of imagelocality is always '0'

The solution is that saving the image information into schedulercache,snapshots, and nodemap in nodeorder.go.

zhifanggao avatar Sep 28 '22 06:09 zhifanggao

new Pr https://github.com/volcano-sh/volcano/pull/2543

zhifanggao avatar Oct 19 '22 11:10 zhifanggao