Richard Liu
Richard Liu
I would like to see a more detailed proposal for the migration plan. Specifically: * How do we avoid having two divergent versions of the MpiJob? * Assuming that the...
Thanks for the reply. For the case with 100 workers - suppose that different users created two such clusters in the same Kubernetes cluster. Neither of them have sufficient workers,...
@xyhuang @swiftdiaries Let's try to have this for 0.5. A few things to consider: 1. How should we automate this? I think it makes sense to create a periodic Prow...
Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).
Sounds great to me. /cc @jlewi
/cc @johnugeorge /cc @terrytangyuan /cc @jian-he
That is the plan. We can add e2e tests with the TestJob as well.
/cc @k82cn /cc @gaocegege /cc @johnugeorge
As a reference, these are the graduation criteria for TFJob 1.0: https://github.com/kubeflow/tf-operator/issues/1076
Can we move the test framework code (test_runner etc) into kubeflow/testing? That way we don't need to replicate the code in every repository.