volcano
volcano copied to clipboard
vcjob的task名称包含 - 时, task-topolopy 插件会误判而失效。
代码位置: https://github.com/volcano-sh/volcano/blob/f6e0a52847fce282210d9b41655248d4bc17b21a/pkg/scheduler/plugins/task-topology/topology.go#L243-L275
问题点: https://github.com/volcano-sh/volcano/blob/f6e0a52847fce282210d9b41655248d4bc17b21a/pkg/scheduler/plugins/task-topology/topology.go#L251
task 名称包含 - 时, tmpStrings := strings.Split(task.Name, "-") 切割的结果会非期望, 放进 taskRef 的 key 就不对了,下面 check 的时候就会报错 “task %s do not exist in job <%s/%s>”
It has been fixed by https://github.com/volcano-sh/volcano/pull/2940 :)
It has been fixed by #2940 :)
改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。
打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 Job 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
It has been fixed by #2940 :)
改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。
打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。
https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
Is it a vcjob or deplopyemnt?
It has been fixed by #2940 :)
改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
Is it a vcjob or deplopyemnt?
vcjob
It has been fixed by #2940 :)
改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
Is it a vcjob or deplopyemnt?
vcjob
前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。
It has been fixed by #2940 :)
改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
Is it a vcjob or deplopyemnt?
vcjob
前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。
can you paste the whole job name and pod name and job yaml?
It has been fixed by #2940 :)
改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
Is it a vcjob or deplopyemnt?
vcjob
前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。
can you paste the whole job name and pod name and job yaml?
# kubectl get pods
NAME READY STATUS RESTARTS AGE
test-binpack-xytsyh2-task-1-0 0/1 CreateContainerError 0 9m1s
test-binpack-xytsyh2-task-2-0 0/1 CreateContainerError 0 9m1s
# kubectl get podgroup
NAME STATUS MINMEMBER RUNNINGS AGE
test-binpack-xytsyh2-d7e821cc-2e6c-4b5a-9818-f3efca39e5da Running 2 59m
#
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
creationTimestamp: "2024-05-16T03:21:00Z"
generation: 1
name: test-binpack-xytsyh2
namespace: default
resourceVersion: "1025579"
uid: d7e821cc-2e6c-4b5a-9818-f3efca39e5da
spec:
maxRetry: 3
minAvailable: 2
queue: default
schedulerName: volcano
tasks:
- maxRetry: 3
minAvailable: 1
name: task-1
replicas: 1
template:
metadata: {}
spec:
containers:
- image: alpine
imagePullPolicy: IfNotPresent
name: test
restartPolicy: OnFailure
- maxRetry: 3
minAvailable: 1
name: task-2
replicas: 1
template:
metadata: {}
spec:
containers:
- image: alpine
imagePullPolicy: IfNotPresent
name: test
restartPolicy: OnFailure
status:
conditions:
- lastTransitionTime: "2024-05-16T03:21:02Z"
status: Pending
minAvailable: 2
pending: 2
state:
lastTransitionTime: "2024-05-16T03:21:02Z"
phase: Pending
taskStatusCount:
task-1:
phase:
Pending: 1
task-2:
phase:
Pending: 1
/good-first-issue
@Monokaix: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@loheagn can you take a look?
@jichangyun Have you set affinity rules in job's annotation?
@jichangyun Have you set affinity rules in job's annotation?
恩,是的,就是 task-1 和 task-2 反亲和。
大概看了下,问题点应该在这里, SetPodGroup 时,会置 ji.Name = pg.Name ,而 pg.Name 是 job.Name + "-" + job.UID . https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372
私有项目就先把 topology.go 文件中的 var jobNamePrefix = job.Name + "-" 改成 var jobNamePrefix = job.Name[:len(job.Name)-36] 了,自测没问题。
但是,个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适,但是目前研究还不够,不敢直接改 job_info.go ,所以先不提 PR 过来了。
个人认为 job_info.go 中的 置
ji.Name = pg.Name不太合适
ji.Name = pg.Name没有问题,本来这俩就是相同的。问题在于controller自动生成的podgroup为啥还要带一个UID,有的又没有。这个UID生成我没找到在哪,我认为没必要有。
So the current issue is that the full task name in topology.go is not necessarily formatted as ${job name}-${task name}-${index}, which might lead to the patch in #2940 not being able to retrieve the correct task name?
个人认为 job_info.go 中的 置
ji.Name = pg.Name不太合适ji.Name = pg.Name没有问题,本来这俩就是相同的。问题在于controller自动生成的podgroup为啥还要带一个UID,有的又没有。这个UID生成我没找到在哪,我认为没必要有。
恩,如果这两个名称是一样的,那应该没问题。 我在 1.7.0 版本上自测,似乎都是 pg.name = job.Name + "-" + job.UID
翻了下 master 分支的代码,好像是在这里。 https://github.com/volcano-sh/volcano/blob/3520a0fe8c4976d5db268e0dcec45fe8f0601781/pkg/controllers/job/job_controller_actions.go#L647-L659
It's a bug definitely. As @jichangyun said, the name of podgroup created by vc-controller isjob.Name - job.UID, but the job name in task.Name has no uid. It means that the annotation configured does not work now, but in old code #2940 , if the task does not contain "-", we will get the right task name no matter what the jobNamePrefix is.
/assign