Yuki Iwai
Yuki Iwai
/kind feature Currently, the training-operator generates certs for the webhook using the internal cert generation mechanism for all installation ways. However, we know that some administrators prefer to use the...
We have many examples, and these allow users to understand easily how to perform TrainingJobs. However, we don't have any verifications if the examples are valid. So, I would propose...
Our TFJob examples are using TensorFlow v1, but that version is too old. - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/tf_sample/Dockerfile#L1 - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/dist-mnist/Dockerfile#L15 - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/distribution_strategy/estimator-API/Dockerfile#L1 - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/mnist_with_summaries/Dockerfile#L15 So, all TFJob examples should be adapted to TensorFlow...
After merging kubeflow/common into this repo, we have many redundant and duplicated codes in this repo. So, we should clean up the whole of this repo. - [x] Replace dummy...
```shell TRAINING_CLIENT.create_mxjob(mxjob, job_namespace) logging.info(f"List of created {constants.MXJOB_KIND}s") logging.info(TRAINING_CLIENT.list_mxjobs(job_namespace)) > verify_job_e2e( TRAINING_CLIENT, JOB_NAME, job_namespace, constants.MXJOB_KIND, CONTAINER_NAME, ) sdk/python/test/e2e/test_e2e_mxjob.py:152: _ _ _ _ _ _ _ _ _ _ _ _ _...
Currently, the mpijob-controller has logic similar to kubeflow/common by oneself since the mpijob-controller does not depend on kubeflow/common. So when we want to fix or add controller logic, we must...
Currently, all framework-job-controller override the `GetPodsForJob` of the `JobController`: - JobController (is overridden): https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/common/pod.go#L217 - mpijob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/mpi/mpijob_controller.go#L508 - mxnetjob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/mxnet/mxjob_controller.go#L289 - paddlejob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go#L284 - pytorchjob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/pytorch/pytorchjob_controller.go#L284 - tfjob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/tensorflow/tfjob_controller.go#L291 -...
Currently, we have many duplicated tests for the tfjob-controller, which are flaky. For example, we have duplicated tests for Job: - https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/tensorflow/job_test.go#L42 - https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/tensorflow/tfjob_controller_test.go#L32 So, we should refactor those tests...
``` ------------------------------ • [FAILED] [0.017 seconds] TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528 Timeline >> STEP: preparing cases succeeded job...
```shell ------------------------------ • [FAILED] [10.015 seconds] TFJob controller Test Exit Code [It] should delete designated Pod /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/pod_test.go:219 Timeline >> STEP: Creating TFJob "test-exit-code" with 1 worker only @ 07/03/23 15:49:26.079...