codeflare-sdk Migrate from MCAD to AppWrapper v1beta2

Rename the flag from mcad to appwrapper
Drop dispatch_priority and related test (obsolete MCAD feature).
Simplify mocked AppWrappers
Port AppWrappers from v1beta1 to v1beta2

Apr 23 '24 01:04 dgrove-oss

I've completed the porting. Ready for review.

May 03 '24 19:05 dgrove-oss

rebased yet again.

May 14 '24 01:05 dgrove-oss

/lgtm

May 22 '24 12:05 astefanutti

The queue label is correct (that is not the problem). This is how we get admission of child resources to work with kueue 0.6. See https://project-codeflare.github.io/appwrapper/arch-controller/#workload-controller for an explanation. We've got a fix into upstream kueue, so this will no longer be needed when kueue 0.7 is released. See https://github.com/kubernetes-sigs/kueue/pull/2059 for gory details.

The problem is that the Ray Worker pod doesn't reach the ready state within the warmup grace period allowed by the appwrapper (5 minutes). Therefore the appwrapper controller decides that the ray cluster has failed and initiates a retry. I'm debugging locally to figure out why the Ray worker pod isn't successfully passing its readiness probe.

May 23 '24 15:05 dgrove-oss

Fixed the appwrapper e2e test.... turns out it was something quite silly: the sdk_user needs to have the RBACs to create appwrappers otherwise the test sits and spins in a permission loop until the timeout expires.

May 23 '24 20:05 dgrove-oss

The problem is that the Ray Worker pod doesn't reach the ready state within the warmup grace period allowed by the appwrapper (5 minutes). Therefore the appwrapper controller decides that the ray cluster has failed and initiates a retry. I'm debugging locally to figure out why the Ray worker pod isn't successfully passing its readiness probe.

This turned out to not be the problem in the e2e tests. It was a problem I was having locally running the e2e tests on a kind cluster without having done all of the ingress setup and dns bashing that is done by codeflare-common's kind GitHub action to prepare the test cluster. So a complete wild goose chase...

May 23 '24 20:05 dgrove-oss

@ChristianZaccaria -- could you add - "--zap-log-level=2" to your codeflare operator's command line? When running with info-level logs enabled, the appwrapper controller embedded in the codeflare operator will print a log message on each state transition. That log would provide a reason why the appwrapper transitions from running to suspending.

Jun 03 '24 21:06 dgrove-oss

Actually, you must have that already since you are showing INFO level logs.
The detailed information is kept in the AppWrapper's status.Conditions array. In particular, the message/reason fields of the conditions will tell you why the controller went into the Resetting or Suspending state (which should be the INFO log message right before the Suspended one).

My usual debugging trick for observing an appwrapper is to do a kubectl get appwrappers --watch -o yaml and then see what is in the conditions array as it is running.

Jun 03 '24 21:06 dgrove-oss

/lgtm

Jun 04 '24 11:06 astefanutti

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, Srihari1192

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [astefanutti]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jun 04 '24 11:06 openshift-ci[bot]

codeflare-sdk codeflare-sdk copied to clipboard

Migrate from MCAD to AppWrapper v1beta2

codeflare-sdk
codeflare-sdk copied to clipboard