training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Inconsistent implementation about when the validation of job's spec failed

Open HeGaoYuan opened this issue 2 years ago • 8 comments

In below codes, when validate the job's spec failed, the process is different. The MPIJob will return an err, so the MPIJob will not continue to creating corresponding pods/services, it will try again after some time. The PytorchJob/TFJob will just print an error log then continue, but it maybe cause unexpected results in the future.

I think we need to discuss what exactly we should do when we validate job's spec failed then we apply it to all Jobs. In my opinion, it should not continue after validating job's spec failed, and we not only to print error log, but also need to record a warning event so that users can know why their Job is blocking through kubectl describe XXJob.

Referring to point4 of https://github.com/kubeflow/training-operator/issues/1703

https://github.com/kubeflow/training-operator/blob/82af677072bfae01099439d587f2791eb41c08d1/pkg/controller.v1/mpi/mpijob_controller.go#L135-L138

https://github.com/kubeflow/training-operator/blob/82af677072bfae01099439d587f2791eb41c08d1/pkg/controller.v1/pytorch/pytorchjob_controller.go#L133-L135

https://github.com/kubeflow/training-operator/blob/82af677072bfae01099439d587f2791eb41c08d1/pkg/controller.v1/tensorflow/tfjob_controller.go#L158-L160

HeGaoYuan avatar Dec 22 '22 14:12 HeGaoYuan

Thanks for reporting.

Yes. we should not continue if validation fails. Also, recording a warning event is a great idea. Can you fix this?

johnugeorge avatar Dec 22 '22 17:12 johnugeorge

/cc @gaocegege @terrytangyuan

johnugeorge avatar Dec 22 '22 17:12 johnugeorge

Yes, I think MPI controller is doing it correctly.

terrytangyuan avatar Dec 23 '22 13:12 terrytangyuan

@terrytangyuan Since error is returned when Validation fails in MPI, reconcile function will be called again. Ref: https://github.com/kubeflow/training-operator/pull/1705#issuecomment-1363648407

johnugeorge avatar Dec 23 '22 14:12 johnugeorge

Maybe this issue does not complete. /reopen

tenzen-y avatar Jan 25 '23 18:01 tenzen-y

@tenzen-y: Reopened this issue.

In response to this:

It looks like not to complete this issue. /reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Jan 25 '23 18:01 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 24 '23 15:08 github-actions[bot]

/lifecycle frozen

tenzen-y avatar Sep 13 '23 07:09 tenzen-y