ray icon indicating copy to clipboard operation
ray copied to clipboard

[serve] Regression of serve DAG API

Open fishbone opened this issue 2 years ago • 16 comments

What happened + What you expected to happen

When testing serve HA, I test with the PR before it's merged and it's working good.

Today I merged the master (after the PR merged) and the testing shows the latency increased and throughput drops.

I believe some change in the PR has regressions. I copied the serve directory in the good image to the master and it's working again.

Versions / Dependencies

master

Reproduction script

With https://github.com/ray-project/ray/pull/27691 it's working.

Issue Severity

High: It blocks me from completing my task.

fishbone avatar Aug 09 '22 03:08 fishbone

cc @scv119 the release blocker

fishbone avatar Aug 09 '22 03:08 fishbone

@simon-mo @edoakes could you take a look at this ASAP? Please check the PR to get some ideas of what might be wrong. Please let me know if there is anything I can help.

fishbone avatar Aug 09 '22 03:08 fishbone

@simon-mo @edoakes could you take a look at this ASAP? Please check the PR to get some ideas of what might be wrong. Please let me know if there is anything I can help.

Hi @iycheng , do you mind sharing how you observe the latency? (any test script that we can reproduce?)

sihanwang41 avatar Aug 09 '22 04:08 sihanwang41

@sihanwang41 Unfortunately, the nightly tests script still hasn't been merged yet https://github.com/ray-project/ray/pull/27413/files

But you can give it a try by running run_gcs_ft_on_k8s.py in ray/release/k8s_tests You need a k8s cluster to run this.

RAY_IMAGE=DOCKER_IMAGE python ./run_gcs_ft_on_k8s.py

Please let me know if you feel something is broken, I can test your PR if you don't know how to reproduce. Or we can pair programming on this.

fishbone avatar Aug 09 '22 04:08 fishbone

I run it by killing nodes (run_gcs_ft_on_k8s.py) and

With the PR:

image

Without the PR:

image

fishbone avatar Aug 09 '22 04:08 fishbone

This is weird... the commit you are using is inconsistent and definitely has bugs...

  • PyObjScanner's self._objects is set to a dict but we ended up appending
  • We were defaulting to sync handle in that particular commit actually (for debugging purpose) image

simon-mo avatar Aug 09 '22 05:08 simon-mo

Can you try run it with SERVE_DEPLOYMENT_HANDLE_IS_SYNC=1 (turn off async) and SERVE_DEPLOYMENT_HANDLE_IS_SYNC=0 (new default behavior)?

simon-mo avatar Aug 09 '22 05:08 simon-mo

Sure, let me give it a try!

fishbone avatar Aug 09 '22 05:08 fishbone

@simon-mo Here is the result

  • SERVE_DEPLOYMENT_HANDLE_IS_SYNC=1
    • the failure is around 5%-10%
  • SERVE_DEPLOYMENT_HANDLE_IS_SYNC=0
    • the failure is very high (25%) and the QPS drops
  • With that PR (that PR is just for reference of the diff between master and the good image)
    • the failure is around 1% (with retry, it should be 99.99%)

fishbone avatar Aug 09 '22 06:08 fishbone

Hmm. The PR should be equivalent to the first one because of the sync handle block.

simon-mo avatar Aug 09 '22 06:08 simon-mo

Hmmm, my bad, I run 1) for a longer time and it looks very similar as 3). Maybe we can just disable it temporarily for now?

But how is the sync one working now?

fishbone avatar Aug 09 '22 07:08 fishbone

Let me run it for a more longer time

fishbone avatar Aug 09 '22 07:08 fishbone

  123 ray       35  15   17.1g   3.3g  66448 S  84.7  10.7  13:26.72 ray::ServeRepli
 8479 ray       35  15   16.6g   2.8g  65748 S  85.3   9.1  11:10.96 ray::ServeRepli
   82 ray       35  15   11.3g 452964  66016 S  17.7   1.4   2:35.67 ray::HTTPProxyA
   59 ray       20   0   14.2g 183700  70312 S   7.0   0.6   0:39.92 python
    9 ray       20   0 1028732 160068  65088 S   0.0   0.5   0:01.56 ray
  119 ray       35  15   12.7g 157644  65748 S  22.7   0.5   3:05.19 ray::ServeRepli
  120 ray       35  15   12.7g 156904  65516 S  21.7   0.5   3:07.32 ray::ServeRepli
   38 ray       20   0  605476  96948  43508 S   3.3   0.3   0:28.19 python

A separate issue is that serve has mem leaking after running it for a while.

fishbone avatar Aug 09 '22 18:08 fishbone

I can reproduce the separate issue locally. looking into it now

simon-mo avatar Aug 09 '22 18:08 simon-mo

Remove it as a release blocker given the sync version somehow working now.

fishbone avatar Aug 09 '22 19:08 fishbone

@iycheng can update from using #27718 here?

simon-mo avatar Aug 10 '22 02:08 simon-mo