volcano icon indicating copy to clipboard operation
volcano copied to clipboard

The task-topology plugin cannot handle the tasks whose name contains `-`

Open loheagn opened this issue 2 years ago • 4 comments

What happened:

Assume that we have a kubernetes cluster with two nodes and a simple job with topology annotations:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/task-topology-affinity: "nginx-worker,ubuntu-worker"
    volcano.sh/task-topology-anti-affinity: "nginx-worker"
  name: example-job-1
spec:
  minAvailable: 5
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 2
      name: nginx-worker
      template:
        spec:
          containers:
            - image: nginx
              name: nginx-main
          restartPolicy: OnFailure
    - replicas: 3
      name: ubuntu-worker
      template:
        spec:
          containers:
            - command:
                - sleep
                - inf
              image: ubuntu
              name: ubuntu-main
          restartPolicy: OnFailure

The names of the tasks in the job all contain char -.

And the scheduler config file:

actions: "enqueue, backfill"
tiers:
  - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: task-topology

After applying the job yaml, the task-topology plugin not works and we get the error log like:

I0627 12:21:06.479367   43679 topology.go:342] start to init task topology plugin, weight[1], defined order map[0:4 1:1 2:2 3:3]
I0627 12:21:06.479469   43679 topology.go:286] Job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> affinity key invalid: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
I0627 12:21:06.479489   43679 topology.go:311] Job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> affinity key invalid: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
I0627 12:21:06.479501   43679 topology.go:227] Failed to read task topology from job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> annotations, error: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.

And the pods are not scheduled as expect:

image

What you expected to happen:

The task-topology should work well and the nginx-worker pods should be scheduled into different nodes.

How to reproduce it (as minimally and precisely as possible):

As described above.

Anything else we need to know?:

I check the code, and find the code may cause this bug https://github.com/volcano-sh/volcano/blob/ed5c215d415d98845949d0fde6707cab29621989/pkg/scheduler/plugins/task-topology/topology.go#L243-L274

Environment:

  • Volcano Version: v1.7.0
  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:12Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: minikube using hyperkit driver on macOS(Intel)
  • Kernel (e.g. uname -a): Linux volcano-demo 5.10.57 #1 SMP Mon Apr 3 23:35:10 UTC 2023 x86_64 GNU/Linux

loheagn avatar Jun 27 '23 05:06 loheagn

If this is a bug not a feature, I'd like to submit a PR to fix it.

loheagn avatar Jun 27 '23 05:06 loheagn

Yes, this looks like a bug, you are welcome to fix it

hwdef avatar Jun 27 '23 06:06 hwdef

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Oct 15 '23 09:10 stale[bot]

/remove-lifecycle-stale

Monokaix avatar Feb 06 '24 06:02 Monokaix