ray [serve] Regression of serve DAG API

What happened + What you expected to happen

When testing serve HA, I test with the PR before it's merged and it's working good.

Today I merged the master (after the PR merged) and the testing shows the latency increased and throughput drops.

I believe some change in the PR has regressions. I copied the serve directory in the good image to the master and it's working again.

Versions / Dependencies

master

Reproduction script

With https://github.com/ray-project/ray/pull/27691 it's working.

Issue Severity

High: It blocks me from completing my task.

Aug 09 '22 03:08 fishbone

cc @scv119 the release blocker

Aug 09 '22 03:08 fishbone

@simon-mo @edoakes could you take a look at this ASAP? Please check the PR to get some ideas of what might be wrong. Please let me know if there is anything I can help.

Aug 09 '22 03:08 fishbone

@simon-mo @edoakes could you take a look at this ASAP? Please check the PR to get some ideas of what might be wrong. Please let me know if there is anything I can help.

Hi @iycheng , do you mind sharing how you observe the latency? (any test script that we can reproduce?)

Aug 09 '22 04:08 sihanwang41

@sihanwang41 Unfortunately, the nightly tests script still hasn't been merged yet https://github.com/ray-project/ray/pull/27413/files

But you can give it a try by running run_gcs_ft_on_k8s.py in ray/release/k8s_tests You need a k8s cluster to run this.

RAY_IMAGE=DOCKER_IMAGE python ./run_gcs_ft_on_k8s.py

Please let me know if you feel something is broken, I can test your PR if you don't know how to reproduce. Or we can pair programming on this.

Aug 09 '22 04:08 fishbone

I run it by killing nodes (run_gcs_ft_on_k8s.py) and

With the PR:

Without the PR:

Aug 09 '22 04:08 fishbone

This is weird... the commit you are using is inconsistent and definitely has bugs...

PyObjScanner's self._objects is set to a dict but we ended up appending
We were defaulting to sync handle in that particular commit actually (for debugging purpose)

Aug 09 '22 05:08 simon-mo

Can you try run it with SERVE_DEPLOYMENT_HANDLE_IS_SYNC=1 (turn off async) and SERVE_DEPLOYMENT_HANDLE_IS_SYNC=0 (new default behavior)?

Aug 09 '22 05:08 simon-mo

Sure, let me give it a try!

Aug 09 '22 05:08 fishbone

@simon-mo Here is the result

SERVE_DEPLOYMENT_HANDLE_IS_SYNC=1
- the failure is around 5%-10%
SERVE_DEPLOYMENT_HANDLE_IS_SYNC=0
- the failure is very high (25%) and the QPS drops
With that PR (that PR is just for reference of the diff between master and the good image)
- the failure is around 1% (with retry, it should be 99.99%)

Aug 09 '22 06:08 fishbone

Hmm. The PR should be equivalent to the first one because of the sync handle block.

Aug 09 '22 06:08 simon-mo

Hmmm, my bad, I run 1) for a longer time and it looks very similar as 3). Maybe we can just disable it temporarily for now?

But how is the sync one working now?

Aug 09 '22 07:08 fishbone

Let me run it for a more longer time

Aug 09 '22 07:08 fishbone

  123 ray       35  15   17.1g   3.3g  66448 S  84.7  10.7  13:26.72 ray::ServeRepli
 8479 ray       35  15   16.6g   2.8g  65748 S  85.3   9.1  11:10.96 ray::ServeRepli
   82 ray       35  15   11.3g 452964  66016 S  17.7   1.4   2:35.67 ray::HTTPProxyA
   59 ray       20   0   14.2g 183700  70312 S   7.0   0.6   0:39.92 python
    9 ray       20   0 1028732 160068  65088 S   0.0   0.5   0:01.56 ray
  119 ray       35  15   12.7g 157644  65748 S  22.7   0.5   3:05.19 ray::ServeRepli
  120 ray       35  15   12.7g 156904  65516 S  21.7   0.5   3:07.32 ray::ServeRepli
   38 ray       20   0  605476  96948  43508 S   3.3   0.3   0:28.19 python

A separate issue is that serve has mem leaking after running it for a while.

Aug 09 '22 18:08 fishbone

I can reproduce the separate issue locally. looking into it now

Aug 09 '22 18:08 simon-mo

Remove it as a release blocker given the sync version somehow working now.

Aug 09 '22 19:08 fishbone

@iycheng can update from using #27718 here?

Aug 10 '22 02:08 simon-mo

ray ray copied to clipboard

[serve] Regression of serve DAG API

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

ray
ray copied to clipboard