Yuki Iwai

Results 116 issues of Yuki Iwai

/kind feature Currently, the training-operator generates certs for the webhook using the internal cert generation mechanism for all installation ways. However, we know that some administrators prefer to use the...

help wanted
good first issue
kind/feature

We have many examples, and these allow users to understand easily how to perform TrainingJobs. However, we don't have any verifications if the examples are valid. So, I would propose...

help wanted
good first issue

Our TFJob examples are using TensorFlow v1, but that version is too old. - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/tf_sample/Dockerfile#L1 - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/dist-mnist/Dockerfile#L15 - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/distribution_strategy/estimator-API/Dockerfile#L1 - https://github.com/kubeflow/training-operator/blob/e2d6ba41d4d11eb333a80cb95a1c22a29e3da156/examples/tensorflow/mnist_with_summaries/Dockerfile#L15 So, all TFJob examples should be adapted to TensorFlow...

help wanted
good first issue

After merging kubeflow/common into this repo, we have many redundant and duplicated codes in this repo. So, we should clean up the whole of this repo. - [x] Replace dummy...

help wanted
good first issue
kind/cleanup

```shell TRAINING_CLIENT.create_mxjob(mxjob, job_namespace) logging.info(f"List of created {constants.MXJOB_KIND}s") logging.info(TRAINING_CLIENT.list_mxjobs(job_namespace)) > verify_job_e2e( TRAINING_CLIENT, JOB_NAME, job_namespace, constants.MXJOB_KIND, CONTAINER_NAME, ) sdk/python/test/e2e/test_e2e_mxjob.py:152: _ _ _ _ _ _ _ _ _ _ _ _ _...

kind/e2e-test-failure
lifecycle/frozen

Currently, the mpijob-controller has logic similar to kubeflow/common by oneself since the mpijob-controller does not depend on kubeflow/common. So when we want to fix or add controller logic, we must...

kind/feature
lifecycle/frozen

Currently, all framework-job-controller override the `GetPodsForJob` of the `JobController`: - JobController (is overridden): https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/common/pod.go#L217 - mpijob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/mpi/mpijob_controller.go#L508 - mxnetjob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/mxnet/mxjob_controller.go#L289 - paddlejob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go#L284 - pytorchjob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/pytorch/pytorchjob_controller.go#L284 - tfjob-controller: https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/tensorflow/tfjob_controller.go#L291 -...

lifecycle/frozen

Currently, we have many duplicated tests for the tfjob-controller, which are flaky. For example, we have duplicated tests for Job: - https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/tensorflow/job_test.go#L42 - https://github.com/kubeflow/training-operator/blob/59cc98cbfc906546b096a29f5d31482fba7cdebf/pkg/controller.v1/tensorflow/tfjob_controller_test.go#L32 So, we should refactor those tests...

lifecycle/frozen

``` ------------------------------ • [FAILED] [0.017 seconds] TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528 Timeline >> STEP: preparing cases succeeded job...

lifecycle/frozen

```shell ------------------------------ • [FAILED] [10.015 seconds] TFJob controller Test Exit Code [It] should delete designated Pod /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/pod_test.go:219 Timeline >> STEP: Creating TFJob "test-exit-code" with 1 worker only @ 07/03/23 15:49:26.079...

lifecycle/frozen