training-operator
training-operator copied to clipboard
Inconsistent implementation about when the validation of job's spec failed
In below codes, when validate the job's spec failed, the process is different. The MPIJob will return an err, so the MPIJob will not continue to creating corresponding pods/services, it will try again after some time. The PytorchJob/TFJob will just print an error log then continue, but it maybe cause unexpected results in the future.
I think we need to discuss what exactly we should do when we validate job's spec failed then we apply it to all Jobs. In my opinion, it should not continue after validating job's spec failed, and we not only to print error log, but also need to record a warning event so that users can know why their Job is blocking through kubectl describe XXJob.
Referring to point4 of https://github.com/kubeflow/training-operator/issues/1703
https://github.com/kubeflow/training-operator/blob/82af677072bfae01099439d587f2791eb41c08d1/pkg/controller.v1/mpi/mpijob_controller.go#L135-L138
https://github.com/kubeflow/training-operator/blob/82af677072bfae01099439d587f2791eb41c08d1/pkg/controller.v1/pytorch/pytorchjob_controller.go#L133-L135
https://github.com/kubeflow/training-operator/blob/82af677072bfae01099439d587f2791eb41c08d1/pkg/controller.v1/tensorflow/tfjob_controller.go#L158-L160
Thanks for reporting.
Yes. we should not continue if validation fails. Also, recording a warning event is a great idea. Can you fix this?
/cc @gaocegege @terrytangyuan
Yes, I think MPI controller is doing it correctly.
@terrytangyuan Since error is returned when Validation fails in MPI, reconcile function will be called again. Ref: https://github.com/kubeflow/training-operator/pull/1705#issuecomment-1363648407
Maybe this issue does not complete. /reopen
@tenzen-y: Reopened this issue.
In response to this:
It looks like not to complete this issue. /reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen