The task-topology plugin cannot handle the tasks whose name contains `-`
What happened:
Assume that we have a kubernetes cluster with two nodes and a simple job with topology annotations:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
annotations:
volcano.sh/task-topology-affinity: "nginx-worker,ubuntu-worker"
volcano.sh/task-topology-anti-affinity: "nginx-worker"
name: example-job-1
spec:
minAvailable: 5
schedulerName: volcano
plugins:
ssh: []
svc: []
tasks:
- replicas: 2
name: nginx-worker
template:
spec:
containers:
- image: nginx
name: nginx-main
restartPolicy: OnFailure
- replicas: 3
name: ubuntu-worker
template:
spec:
containers:
- command:
- sleep
- inf
image: ubuntu
name: ubuntu-main
restartPolicy: OnFailure
The names of the tasks in the job all contain char -.
And the scheduler config file:
actions: "enqueue, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: task-topology
After applying the job yaml, the task-topology plugin not works and we get the error log like:
I0627 12:21:06.479367 43679 topology.go:342] start to init task topology plugin, weight[1], defined order map[0:4 1:1 2:2 3:3]
I0627 12:21:06.479469 43679 topology.go:286] Job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> affinity key invalid: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
I0627 12:21:06.479489 43679 topology.go:311] Job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> affinity key invalid: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
I0627 12:21:06.479501 43679 topology.go:227] Failed to read task topology from job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3> annotations, error: task nginx-worker do not exist in job <default/example-job-1-506fbf7a-e9c1-4ae4-b37a-60b1d67f5ee3>.
And the pods are not scheduled as expect:
What you expected to happen:
The task-topology should work well and the nginx-worker pods should be scheduled into different nodes.
How to reproduce it (as minimally and precisely as possible):
As described above.
Anything else we need to know?:
I check the code, and find the code may cause this bug https://github.com/volcano-sh/volcano/blob/ed5c215d415d98845949d0fde6707cab29621989/pkg/scheduler/plugins/task-topology/topology.go#L243-L274
Environment:
- Volcano Version: v1.7.0
- Kubernetes version (use
kubectl version): Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:12Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"} - Cloud provider or hardware configuration: minikube using hyperkit driver on macOS(Intel)
- Kernel (e.g.
uname -a):Linux volcano-demo 5.10.57 #1 SMP Mon Apr 3 23:35:10 UTC 2023 x86_64 GNU/Linux
If this is a bug not a feature, I'd like to submit a PR to fix it.
Yes, this looks like a bug, you are welcome to fix it
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
/remove-lifecycle-stale