codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

Migrate from MCAD to AppWrapper v1beta2

Open dgrove-oss opened this issue 1 year ago • 7 comments

  1. Rename the flag from mcad to appwrapper
  2. Drop dispatch_priority and related test (obsolete MCAD feature).
  3. Simplify mocked AppWrappers
  4. Port AppWrappers from v1beta1 to v1beta2

dgrove-oss avatar Apr 23 '24 01:04 dgrove-oss

I've completed the porting. Ready for review.

dgrove-oss avatar May 03 '24 19:05 dgrove-oss

rebased yet again.

dgrove-oss avatar May 14 '24 01:05 dgrove-oss

/lgtm

astefanutti avatar May 22 '24 12:05 astefanutti

The queue label is correct (that is not the problem). This is how we get admission of child resources to work with kueue 0.6. See https://project-codeflare.github.io/appwrapper/arch-controller/#workload-controller for an explanation. We've got a fix into upstream kueue, so this will no longer be needed when kueue 0.7 is released. See https://github.com/kubernetes-sigs/kueue/pull/2059 for gory details.

The problem is that the Ray Worker pod doesn't reach the ready state within the warmup grace period allowed by the appwrapper (5 minutes). Therefore the appwrapper controller decides that the ray cluster has failed and initiates a retry. I'm debugging locally to figure out why the Ray worker pod isn't successfully passing its readiness probe.

dgrove-oss avatar May 23 '24 15:05 dgrove-oss

Fixed the appwrapper e2e test.... turns out it was something quite silly: the sdk_user needs to have the RBACs to create appwrappers otherwise the test sits and spins in a permission loop until the timeout expires.

dgrove-oss avatar May 23 '24 20:05 dgrove-oss

The problem is that the Ray Worker pod doesn't reach the ready state within the warmup grace period allowed by the appwrapper (5 minutes). Therefore the appwrapper controller decides that the ray cluster has failed and initiates a retry. I'm debugging locally to figure out why the Ray worker pod isn't successfully passing its readiness probe.

This turned out to not be the problem in the e2e tests. It was a problem I was having locally running the e2e tests on a kind cluster without having done all of the ingress setup and dns bashing that is done by codeflare-common's kind GitHub action to prepare the test cluster. So a complete wild goose chase...

dgrove-oss avatar May 23 '24 20:05 dgrove-oss

@ChristianZaccaria -- could you add - "--zap-log-level=2" to your codeflare operator's command line? When running with info-level logs enabled, the appwrapper controller embedded in the codeflare operator will print a log message on each state transition. That log would provide a reason why the appwrapper transitions from running to suspending.

dgrove-oss avatar Jun 03 '24 21:06 dgrove-oss

Actually, you must have that already since you are showing INFO level logs.
The detailed information is kept in the AppWrapper's status.Conditions array. In particular, the message/reason fields of the conditions will tell you why the controller went into the Resetting or Suspending state (which should be the INFO log message right before the Suspended one).

My usual debugging trick for observing an appwrapper is to do a kubectl get appwrappers --watch -o yaml and then see what is in the conditions array as it is running.

dgrove-oss avatar Jun 03 '24 21:06 dgrove-oss

/lgtm

astefanutti avatar Jun 04 '24 11:06 astefanutti

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti, Srihari1192

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jun 04 '24 11:06 openshift-ci[bot]