ray
ray copied to clipboard
[serve][deploy refactor][3/X] Move deployment state manager into application state manager
Why are these changes needed?
The main goal is to adopt a proper reconciler pattern for the application state manager.
- This requires moving deployment state manager inside application state manager.
- Previously, the controller would take care of telling the deployment state manager to deploy or delete applications.
- Now, 2.0 deployments are deployed through application state manager
- 1.0 code path is:
controller.deploy()
->deployment_state_manager.deploy()
on a duplicated code path.
- When
serve.run
is run or a new config is deployed, this sets the application's target state. - In the application state's
update()
, reconcile the target state. Repeatedly:- Deploy all deployments listed in the application's target state
- Delete all (past) deployments that aren't part of the application's current target state
- Update application status based on deployment statuses
- Checkpoint deployment infos per application. This is now necessary because with a reconciler pattern, deployments won't be checkpointed in the deployment state manager after the call to deploy a config returns; it takes the next iteration of
update()
for it to be checkpointed in the deployment state manager.- This is now redundant with the checkpointing in deployment state manager. We might be able to merge them somehow in the future.
Tests in test_application_state
:
- Test different flows in application status:
- Basic flow:
DEPLOYING -> RUNNING -> DELETING
- Application stuck on deploy failed:
DEPLOYING -> DEPLOY_FAILED -> (redeploy) -> DEPLOYING -> RUNNING
- Application recovers from deploy failed:
DEPLOYING -> DEPLOY_FAILED -> (recovers) -> RUNNING
- Application becomes unhealthy after successfully deploying:
DEPLOYING -> RUNNING -> UNHEALTHY -> RUNNING
- Basic flow:
- Test deploy through config, which requires waiting for the deploy task to finish
- If deploy task succeeds and deployments become healthy, application should transition to
RUNNING
- If deploy task fails, application should transition to
DEPLOY_FAILED
- If deploy task succeeds and deployments become healthy, application should transition to
- Redeploy same application with different deployments
- Test recovery
- Test controller crashes while the application is running normally
DEPLOYING -> RUNNING -> (controller crash) -> DEPLOYING -> RUNNING
- Test controller crashes while the application is deploying
DEPLOYING -> (controller crash) -> DEPLOYING -> RUNNING
. In particular, the deployment versions should be correctly reconciled.
- Test controller crashes while the application is running normally
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [ ] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/
under the corresponding.rst
file.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@zcin LMK when ready for review
@edoakes Have yet to add any new tests, but ready for first round of review!
@ray-project/ray-serve This looks like a humongous PR but a huge chunk of the changes are in test_application_state
. To make that file easier to review I've added a list of the tests to the PR description, and have also tried my best to add comments to the tests to describe the flow of each test. Most of the logic changes are in application_state
, and changes to other files only have small changes to cooperate with that.
@zcin , can you help to add a test redeploy the same app while app is under deletion in application state test? (ignore this message if I miss it)
@zcin , can you help to add a test redeploy the same app while app is under deletion in application state test? (ignore this message if I miss it)
@sihanwang41 Ah this test exists in test_application_state
. The behavior can be improved (tracked in https://github.com/anyscale/product/issues/18753), but planning to make the improvements separately to avoid overloading this PR.
@edoakes I've done a bunch of manual testing by killing the controller at random points, and everything I've tested works as expected! I also found the logs we currently have to be pretty useful for figuring out at what point the controller crashed, and what state was recovered during controller recovery. (I added one log for recovering config, if apps were deployed through a config)
@edoakes Tests look good, the two failing tests are failing on master. The last release test run is here: https://buildkite.com/ray-project/release-tests-pr/builds/42684#_, which also looks good. (Changes since the last run of release tests were purely log statements / adding unit tests)