volcano icon indicating copy to clipboard operation
volcano copied to clipboard

vcjob的task名称包含 - 时, task-topolopy 插件会误判而失效。

Open jichangyun opened this issue 1 year ago • 15 comments

代码位置: https://github.com/volcano-sh/volcano/blob/f6e0a52847fce282210d9b41655248d4bc17b21a/pkg/scheduler/plugins/task-topology/topology.go#L243-L275

问题点: https://github.com/volcano-sh/volcano/blob/f6e0a52847fce282210d9b41655248d4bc17b21a/pkg/scheduler/plugins/task-topology/topology.go#L251

task 名称包含 - 时, tmpStrings := strings.Split(task.Name, "-") 切割的结果会非期望, 放进 taskRef 的 key 就不对了,下面 check 的时候就会报错 “task %s do not exist in job <%s/%s>”

jichangyun avatar May 15 '24 08:05 jichangyun

It has been fixed by https://github.com/volcano-sh/volcano/pull/2940 :)

Monokaix avatar May 15 '24 09:05 Monokaix

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。

打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 Job 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

jichangyun avatar May 16 '24 02:05 jichangyun

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。

打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。

https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

Is it a vcjob or deplopyemnt?

Monokaix avatar May 16 '24 03:05 Monokaix

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

Is it a vcjob or deplopyemnt?

vcjob

jichangyun avatar May 16 '24 03:05 jichangyun

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

Is it a vcjob or deplopyemnt?

vcjob

前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。

jichangyun avatar May 16 '24 03:05 jichangyun

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

Is it a vcjob or deplopyemnt?

vcjob

前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。

can you paste the whole job name and pod name and job yaml?

Monokaix avatar May 16 '24 04:05 Monokaix

It has been fixed by #2940 :)

改动好像有点问题,把改动合入到 1.7.0 版本后仍有问题。 打日志看到 jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,有 uuid 后缀,这个 uuid 是 podgroup 的 uid 看了 master 分支的代码, SetPodGroup 时,会置 ji.Name = pg.Name 。 https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

Is it a vcjob or deplopyemnt?

vcjob

前面写错了, jobNamePrefix 为 test-xytsyh2-d6d4a2e5-1ddb-4d1d-98ec-8eb97b054e63- ,后面的 uuid 是 vcjob 的UID。

can you paste the whole job name and pod name and job yaml?

# kubectl get pods
NAME                            READY   STATUS                 RESTARTS         AGE
test-binpack-xytsyh2-task-1-0   0/1     CreateContainerError   0                9m1s
test-binpack-xytsyh2-task-2-0   0/1     CreateContainerError   0                9m1s

# kubectl get podgroup
NAME                                                        STATUS    MINMEMBER   RUNNINGS   AGE
test-binpack-xytsyh2-d7e821cc-2e6c-4b5a-9818-f3efca39e5da   Running   2                      59m

# 
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  creationTimestamp: "2024-05-16T03:21:00Z"
  generation: 1
  name: test-binpack-xytsyh2
  namespace: default
  resourceVersion: "1025579"
  uid: d7e821cc-2e6c-4b5a-9818-f3efca39e5da
spec:
  maxRetry: 3
  minAvailable: 2
  queue: default
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: task-1
    replicas: 1
    template:
      metadata: {}
      spec:
        containers:
        - image: alpine
          imagePullPolicy: IfNotPresent
          name: test
        restartPolicy: OnFailure
  - maxRetry: 3
    minAvailable: 1
    name: task-2
    replicas: 1
    template:
      metadata: {}
      spec:
        containers:
        - image: alpine
          imagePullPolicy: IfNotPresent
          name: test
        restartPolicy: OnFailure
status:
  conditions:
  - lastTransitionTime: "2024-05-16T03:21:02Z"
    status: Pending
  minAvailable: 2
  pending: 2
  state:
    lastTransitionTime: "2024-05-16T03:21:02Z"
    phase: Pending
  taskStatusCount:
    task-1:
      phase:
        Pending: 1
    task-2:
      phase:
        Pending: 1

jichangyun avatar May 16 '24 04:05 jichangyun

/good-first-issue

Monokaix avatar May 16 '24 07:05 Monokaix

@Monokaix: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar May 16 '24 07:05 volcano-sh-bot

@loheagn can you take a look?

Monokaix avatar May 17 '24 04:05 Monokaix

@jichangyun Have you set affinity rules in job's annotation?

Monokaix avatar May 17 '24 04:05 Monokaix

@jichangyun Have you set affinity rules in job's annotation?

恩,是的,就是 task-1 和 task-2 反亲和。

大概看了下,问题点应该在这里, SetPodGroup 时,会置 ji.Name = pg.Name ,而 pg.Name 是 job.Name + "-" + job.UID . https://github.com/volcano-sh/volcano/blob/016d2152ffe1b2f13d69462516ab015baee3c147/pkg/scheduler/api/job_info.go#L366-L372

私有项目就先把 topology.go 文件中的 var jobNamePrefix = job.Name + "-" 改成 var jobNamePrefix = job.Name[:len(job.Name)-36] 了,自测没问题。 但是,个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适,但是目前研究还不够,不敢直接改 job_info.go ,所以先不提 PR 过来了。

jichangyun avatar May 17 '24 04:05 jichangyun

个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适

ji.Name = pg.Name没有问题,本来这俩就是相同的。问题在于controller自动生成的podgroup为啥还要带一个UID,有的又没有。这个UID生成我没找到在哪,我认为没必要有。

lowang-bh avatar May 17 '24 13:05 lowang-bh

So the current issue is that the full task name in topology.go is not necessarily formatted as ${job name}-${task name}-${index}, which might lead to the patch in #2940 not being able to retrieve the correct task name?

loheagn avatar May 17 '24 13:05 loheagn

个人认为 job_info.go 中的 置 ji.Name = pg.Name 不太合适

ji.Name = pg.Name没有问题,本来这俩就是相同的。问题在于controller自动生成的podgroup为啥还要带一个UID,有的又没有。这个UID生成我没找到在哪,我认为没必要有。

恩,如果这两个名称是一样的,那应该没问题。 我在 1.7.0 版本上自测,似乎都是 pg.name = job.Name + "-" + job.UID

翻了下 master 分支的代码,好像是在这里。 https://github.com/volcano-sh/volcano/blob/3520a0fe8c4976d5db268e0dcec45fe8f0601781/pkg/controllers/job/job_controller_actions.go#L647-L659

jichangyun avatar May 20 '24 08:05 jichangyun

It's a bug definitely. As @jichangyun said, the name of podgroup created by vc-controller isjob.Name - job.UID, but the job name in task.Name has no uid. It means that the annotation configured does not work now, but in old code #2940 , if the task does not contain "-", we will get the right task name no matter what the jobNamePrefix is.

JesseStutler avatar Sep 06 '24 03:09 JesseStutler

/assign

JesseStutler avatar Sep 09 '24 12:09 JesseStutler