ray
ray copied to clipboard
[serve] Regression of serve DAG API
What happened + What you expected to happen
When testing serve HA, I test with the PR before it's merged and it's working good.
Today I merged the master (after the PR merged) and the testing shows the latency increased and throughput drops.
I believe some change in the PR has regressions. I copied the serve directory in the good image to the master and it's working again.
Versions / Dependencies
master
Reproduction script
With https://github.com/ray-project/ray/pull/27691 it's working.
Issue Severity
High: It blocks me from completing my task.
cc @scv119 the release blocker
@simon-mo @edoakes could you take a look at this ASAP? Please check the PR to get some ideas of what might be wrong. Please let me know if there is anything I can help.
@simon-mo @edoakes could you take a look at this ASAP? Please check the PR to get some ideas of what might be wrong. Please let me know if there is anything I can help.
Hi @iycheng , do you mind sharing how you observe the latency? (any test script that we can reproduce?)
@sihanwang41 Unfortunately, the nightly tests script still hasn't been merged yet https://github.com/ray-project/ray/pull/27413/files
But you can give it a try by running run_gcs_ft_on_k8s.py in ray/release/k8s_tests You need a k8s cluster to run this.
RAY_IMAGE=DOCKER_IMAGE python ./run_gcs_ft_on_k8s.py
Please let me know if you feel something is broken, I can test your PR if you don't know how to reproduce. Or we can pair programming on this.
I run it by killing nodes (run_gcs_ft_on_k8s.py) and
With the PR:

Without the PR:

This is weird... the commit you are using is inconsistent and definitely has bugs...
- PyObjScanner's self._objects is set to a dict but we ended up appending
- We were defaulting to sync handle in that particular commit actually (for debugging purpose)
Can you try run it with SERVE_DEPLOYMENT_HANDLE_IS_SYNC=1
(turn off async) and SERVE_DEPLOYMENT_HANDLE_IS_SYNC=0
(new default behavior)?
Sure, let me give it a try!
@simon-mo Here is the result
- SERVE_DEPLOYMENT_HANDLE_IS_SYNC=1
- the failure is around 5%-10%
- SERVE_DEPLOYMENT_HANDLE_IS_SYNC=0
- the failure is very high (25%) and the QPS drops
- With that PR (that PR is just for reference of the diff between master and the good image)
- the failure is around 1% (with retry, it should be 99.99%)
Hmm. The PR should be equivalent to the first one because of the sync handle block.
Hmmm, my bad, I run 1) for a longer time and it looks very similar as 3). Maybe we can just disable it temporarily for now?
But how is the sync one working now?
Let me run it for a more longer time
123 ray 35 15 17.1g 3.3g 66448 S 84.7 10.7 13:26.72 ray::ServeRepli
8479 ray 35 15 16.6g 2.8g 65748 S 85.3 9.1 11:10.96 ray::ServeRepli
82 ray 35 15 11.3g 452964 66016 S 17.7 1.4 2:35.67 ray::HTTPProxyA
59 ray 20 0 14.2g 183700 70312 S 7.0 0.6 0:39.92 python
9 ray 20 0 1028732 160068 65088 S 0.0 0.5 0:01.56 ray
119 ray 35 15 12.7g 157644 65748 S 22.7 0.5 3:05.19 ray::ServeRepli
120 ray 35 15 12.7g 156904 65516 S 21.7 0.5 3:07.32 ray::ServeRepli
38 ray 20 0 605476 96948 43508 S 3.3 0.3 0:28.19 python
A separate issue is that serve has mem leaking after running it for a while.
I can reproduce the separate issue locally. looking into it now
Remove it as a release blocker given the sync version somehow working now.
@iycheng can update from using #27718 here?