pipelines icon indicating copy to clipboard operation
pipelines copied to clipboard

[feature] Build and test V2 driver / launcher images against incoming PRs

Open CarterFendley opened this issue 5 months ago • 9 comments

Feature Area

/area backend /area samples

What feature would you like to see?

The V2 backend driver / launcher images being built and tested against incoming PRs through integration tests / etc.

What is the use case or pain point?

Assure stability of driver / launcher.

Is there a workaround currently?

Trust people to test driver / launcher locally.

More details

Currently, the kfp-cluster action, currently used by the workflows listed below, uses build-images.sh to build a set of images and push to the kind registry.

  • e2e-test.yml
  • kfp-kubernetes-execution-tests.yml
  • kfp-samples.yml <-- My primary focus at the moment
  • kubeflow-pipelines-integration-v2.yml
  • periodic.yml
  • sdk-execution.yml
  • upgrade-test.yml

The set of images which are built by build-images.sh does not currently include the V2 driver and launcher.

Even if this is changed, there would still be additional work required to assure these built images would be used by the backend during testing. Namely, the backend has defaults for which images to use (see here) which normally point to gcr.io locations. Work would need to be done to override these defaults so that during PR testing, the built images would be used instead of the ones deployed previously on gcr.io.

Discussion of implementation

  1. Updating build-image.sh would likely be pretty straight forward.
  2. The argo compiler accepts V2_DRIVER_IMAGE / V2_LAUNCHER_IMAGE environment variables to override the gcr.io defaults (configured via the deployment.apps/ml-pipeline deployment). @hbelmiro has suggested maybe using a Kustomize layer for updating these during testing.

What about releases?

Although it makes sense to build driver / launcher images and test them during the PRs it may make sense to NOT override the V2_DRIVER_IMAGE / V2_LAUNCHER_IMAGE defaults and test against the gcr.io deployments when validating releases. Since users will be unlikely to override these values and use gcr.io it is reasonable to test in that configuration.

I am not aware of the extent to which kfp-samples.yml (or other workflows consuming the kfp-cluster action) are executed during release processes. Please let me know if others have more info on this :)


Related slack thread: https://cloud-native.slack.com/archives/C073N7BMLB1/p1727104197895549

Love this idea? Give it a 👍.

CarterFendley avatar Sep 23 '24 20:09 CarterFendley