training-operator The behavior is unexpected when replicas of job set to 0

The behavior is unexpected when replicas of job set to 0

Open HeGaoYuan opened this issue 2 years ago • 5 comments

I tested by following yaml which the replicas of Worker was set to 0. I think the expectation is as there is no Worker at all, but the real behavior is as following picture, it would create one Worker Pod, then delete it, then recreate it, then redelete it....

Referring to point4 of https://github.com/kubeflow/training-operator/issues/1703

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-test-replicas
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: alpine:latest
              command: ["sleep", "365d"]
    Worker:
      replicas: 0
      template:
        spec:
          containers:
            - name: pytorch
              image: alpine:latest
              command: ["sleep", "365d"]

截屏2022-12-26 上午11 38 36

Dec 26 '22 03:12 HeGaoYuan

The reason I found is following codes, the initial of variable size should be set -1 not 0.

Of course, there are other solutions. For example, remove the whole Worker part when replicas of Worker is 0. (which must not update the job spec in etcd, just update the internal job spec)

We can discuss which solution is better.

https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/core/pod.go#L51

https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/core/service.go#L53

Dec 26 '22 03:12 HeGaoYuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 24 '23 10:08 github-actions[bot]

/remove-lifecycle stale

Aug 24 '23 13:08 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 22 '23 15:11 github-actions[bot]

/lifecycle frozen

Nov 22 '23 15:11 andreyvelich

training-operator training-operator copied to clipboard

The behavior is unexpected when replicas of job set to 0

training-operator
training-operator copied to clipboard