codeflare-sdk
codeflare-sdk copied to clipboard
Migrate from MCAD to AppWrapper v1beta2
- Rename the flag from
mcadtoappwrapper - Drop dispatch_priority and related test (obsolete MCAD feature).
- Simplify mocked AppWrappers
- Port AppWrappers from v1beta1 to v1beta2
I've completed the porting. Ready for review.
rebased yet again.
/lgtm
The queue label is correct (that is not the problem). This is how we get admission of child resources to work with kueue 0.6. See https://project-codeflare.github.io/appwrapper/arch-controller/#workload-controller for an explanation. We've got a fix into upstream kueue, so this will no longer be needed when kueue 0.7 is released. See https://github.com/kubernetes-sigs/kueue/pull/2059 for gory details.
The problem is that the Ray Worker pod doesn't reach the ready state within the warmup grace period allowed by the appwrapper (5 minutes). Therefore the appwrapper controller decides that the ray cluster has failed and initiates a retry. I'm debugging locally to figure out why the Ray worker pod isn't successfully passing its readiness probe.
Fixed the appwrapper e2e test.... turns out it was something quite silly: the sdk_user needs to have the RBACs to create appwrappers otherwise the test sits and spins in a permission loop until the timeout expires.
The problem is that the Ray Worker pod doesn't reach the ready state within the warmup grace period allowed by the appwrapper (5 minutes). Therefore the appwrapper controller decides that the ray cluster has failed and initiates a retry. I'm debugging locally to figure out why the Ray worker pod isn't successfully passing its readiness probe.
This turned out to not be the problem in the e2e tests. It was a problem I was having locally running the e2e tests on a kind cluster without having done all of the ingress setup and dns bashing that is done by codeflare-common's kind GitHub action to prepare the test cluster. So a complete wild goose chase...
@ChristianZaccaria -- could you add - "--zap-log-level=2" to your codeflare operator's command line? When running with info-level logs enabled, the appwrapper controller embedded in the codeflare operator will print a log message on each state transition. That log would provide a reason why the appwrapper transitions from running to suspending.
Actually, you must have that already since you are showing INFO level logs.
The detailed information is kept in the AppWrapper's status.Conditions array. In particular, the message/reason fields of the conditions will tell you why the controller went into the Resetting or Suspending state (which should be the INFO log message right before the Suspended one).
My usual debugging trick for observing an appwrapper is to do a kubectl get appwrappers --watch -o yaml and then see what is in the conditions array as it is running.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: astefanutti, Srihari1192
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [astefanutti]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment