training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Add workflows to verify if examples are valid

Open tenzen-y opened this issue 1 year ago • 6 comments

We have many examples, and these allow users to understand easily how to perform TrainingJobs. However, we don't have any verifications if the examples are valid. So, I would propose that we add CI workflows to verify that examples are working.

Katib workflows would be good examples to implement in the training-operator: https://github.com/kubeflow/katib/blob/master/.github/workflows/e2e-test-pytorch-mnist.yaml

/good-first-issue

tenzen-y avatar Mar 08 '24 12:03 tenzen-y

@tenzen-y: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

We have many examples, and these allow users to understand easily how to perform TrainingJobs. However, we don't have any verifications if the examples are valid. So, I would propose that we add CI workflows to verify that examples are working.

Katib workflows would be good examples to implement in the training-operator: https://github.com/kubeflow/katib/blob/master/.github/workflows/e2e-test-pytorch-mnist.yaml

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Mar 08 '24 12:03 google-oss-prow[bot]

I'd like to work on this GitHub Action for the training operator examples issue. It matches my difficulty level. Any guidance you can provide would be greatly appreciated and will help me proceed forward faster.

/assign

shivas1516 avatar Mar 10 '24 18:03 shivas1516

@tenzen-y Are adding e2e tests in workflow necessary for verifying Training Operator examples like in Katib? Can you provide some additional information to this. it helps me to solve this issue

shivas1516 avatar Mar 24 '24 00:03 shivas1516

@tenzen-y Are adding e2e tests in workflow necessary for verifying Training Operator examples like in Katib? Can you provide some additional information to this. it helps me to solve this issue

We need to implement the following steps in the script:

  1. Build example and operator images
  2. Start KinD cluster
  3. Load built images into the cluster
  4. Set up the TrainingOperator
  5. Create a Job with built images
  6. Verify if a created Job succeeded.

tenzen-y avatar May 14 '24 16:05 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 12 '24 20:08 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar Aug 13 '24 13:08 andreyvelich