ray icon indicating copy to clipboard operation
ray copied to clipboard

[serve][deploy refactor][3/X] Move deployment state manager into application state manager

Open zcin opened this issue 1 year ago • 3 comments

Why are these changes needed?

The main goal is to adopt a proper reconciler pattern for the application state manager.

  • This requires moving deployment state manager inside application state manager.
    • Previously, the controller would take care of telling the deployment state manager to deploy or delete applications.
    • Now, 2.0 deployments are deployed through application state manager
    • 1.0 code path is: controller.deploy() -> deployment_state_manager.deploy() on a duplicated code path.
  • When serve.run is run or a new config is deployed, this sets the application's target state.
  • In the application state's update(), reconcile the target state. Repeatedly:
    • Deploy all deployments listed in the application's target state
    • Delete all (past) deployments that aren't part of the application's current target state
    • Update application status based on deployment statuses
  • Checkpoint deployment infos per application. This is now necessary because with a reconciler pattern, deployments won't be checkpointed in the deployment state manager after the call to deploy a config returns; it takes the next iteration of update() for it to be checkpointed in the deployment state manager.
    • This is now redundant with the checkpointing in deployment state manager. We might be able to merge them somehow in the future.

Tests in test_application_state:

  • Test different flows in application status:
    • Basic flow: DEPLOYING -> RUNNING -> DELETING
    • Application stuck on deploy failed: DEPLOYING -> DEPLOY_FAILED -> (redeploy) -> DEPLOYING -> RUNNING
    • Application recovers from deploy failed: DEPLOYING -> DEPLOY_FAILED -> (recovers) -> RUNNING
    • Application becomes unhealthy after successfully deploying: DEPLOYING -> RUNNING -> UNHEALTHY -> RUNNING
  • Test deploy through config, which requires waiting for the deploy task to finish
    • If deploy task succeeds and deployments become healthy, application should transition to RUNNING
    • If deploy task fails, application should transition to DEPLOY_FAILED
  • Redeploy same application with different deployments
  • Test recovery
    • Test controller crashes while the application is running normally DEPLOYING -> RUNNING -> (controller crash) -> DEPLOYING -> RUNNING
    • Test controller crashes while the application is deploying DEPLOYING -> (controller crash) -> DEPLOYING -> RUNNING. In particular, the deployment versions should be correctly reconciled.

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

zcin avatar Apr 12 '23 18:04 zcin

@zcin LMK when ready for review

edoakes avatar May 03 '23 16:05 edoakes

@edoakes Have yet to add any new tests, but ready for first round of review!

zcin avatar May 03 '23 16:05 zcin

@ray-project/ray-serve This looks like a humongous PR but a huge chunk of the changes are in test_application_state. To make that file easier to review I've added a list of the tests to the PR description, and have also tried my best to add comments to the tests to describe the flow of each test. Most of the logic changes are in application_state, and changes to other files only have small changes to cooperate with that.

zcin avatar Jun 12 '23 17:06 zcin

@zcin , can you help to add a test redeploy the same app while app is under deletion in application state test? (ignore this message if I miss it)

sihanwang41 avatar Jun 15 '23 20:06 sihanwang41

@zcin , can you help to add a test redeploy the same app while app is under deletion in application state test? (ignore this message if I miss it)

@sihanwang41 Ah this test exists in test_application_state. The behavior can be improved (tracked in https://github.com/anyscale/product/issues/18753), but planning to make the improvements separately to avoid overloading this PR.

zcin avatar Jun 15 '23 21:06 zcin

@edoakes I've done a bunch of manual testing by killing the controller at random points, and everything I've tested works as expected! I also found the logs we currently have to be pretty useful for figuring out at what point the controller crashed, and what state was recovered during controller recovery. (I added one log for recovering config, if apps were deployed through a config)

zcin avatar Jun 20 '23 14:06 zcin

@edoakes Tests look good, the two failing tests are failing on master. The last release test run is here: https://buildkite.com/ray-project/release-tests-pr/builds/42684#_, which also looks good. (Changes since the last run of release tests were purely log statements / adding unit tests)

zcin avatar Jun 20 '23 23:06 zcin