training-operator
training-operator copied to clipboard
The behavior is unexpected when replicas of job set to 0
I tested by following yaml which the replicas of Worker was set to 0. I think the expectation is as there is no Worker at all, but the real behavior is as following picture, it would create one Worker Pod, then delete it, then recreate it, then redelete it....
Referring to point4 of https://github.com/kubeflow/training-operator/issues/1703
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-test-replicas
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: alpine:latest
command: ["sleep", "365d"]
Worker:
replicas: 0
template:
spec:
containers:
- name: pytorch
image: alpine:latest
command: ["sleep", "365d"]

The reason I found is following codes, the initial of variable size should be set -1 not 0.
Of course, there are other solutions. For example, remove the whole Worker part when replicas of Worker is 0. (which must not update the job spec in etcd, just update the internal job spec)
We can discuss which solution is better.
https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/core/pod.go#L51
https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/core/service.go#L53
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen