ray icon indicating copy to clipboard operation
ray copied to clipboard

[core][nightly] many_actors regression

Open rickyyx opened this issue 2 years ago • 3 comments

What happened + What you expected to happen

image

~~Pretty sure it's due to https://github.com/ray-project/ray/pull/31207~~

Versions / Dependencies

master

Reproduction script

NA

Issue Severity

None

rickyyx avatar Dec 22 '22 01:12 rickyyx

Ah, actually the regression happens before #31207.

The last non-regressed run: https://buildkite.com/ray-project/release-tests-branch/builds/1257#01851d3e-e5fb-46c6-a471-6c499d6dc70e The first regressed run: https://buildkite.com/ray-project/release-tests-branch/builds/1259#01852c73-f3c3-429c-8a10-ebb90d7dbac7

f17375d3b4 [Datasets] Fix import of Literal for map_batches benchmark w/ Python 3.7 (#31188) c6924372f9 [Datasets] [Hotfix] Remove Arrow 8 upper-bound in ray[data] extra (#31186) 1bfc05a47a Fix typo in a comment (#31187) c91f8615df [Docs/Tune] "getting data in and out of Tune" doc (#30717) 49b1ed180d [RLlib] RLModule: Implement TorchVectorEncoder. (#29875) 9bb62fbe30 [RLlib] Test SampleBatch.get_interceptor functionality. (#31024) d28552b7b6 [RLlib] Fix tf timeline in ray local mode. (#31139) 608276bb96 Simplify logging configuration. (#30863) c03c4c6718 [Tune] Fix AxSearch search space conversion for fixed list hyperparameters (#31088) 24b6a8035a [RLlib] Make ComplexInputNet use ModuleDict to register sub-modules. (#31085) 70b0d039da [RLlib] Fix erroring our when tuning any algo without torch being installed. (#31177) 98fef77328 [runtime env] Support python 3.10 for runtime_env conda (#30970) 69b2506c28 [Dashboard New IA] Add overview cards and collapsible sections to the Dashboard (#31013) 8f499013de [docs] Update use cases to include new blog, cleanup 6334e55c7f [Doc] Reference ray core patterns in docstrings (#30964) d57b582d98 Improving message for access to unknown objects (#28209) 4267ac222b [RLlib] Fix reward collection for OpenSpiel games (#31156) 80d0bc72f7 [Doc] fix the tune's installation instruction in the getting started example (#30280) 66c79f847a Fix tiny typo (#31007) 1829be0a4f [Core] make ray.all work again for static analyzers such as mypy (#31132) f4032d7f84 [Docs] Fix plasma docs related to arrow (#31172) f2e955cfb4 [RLlib] Split torch and tf release learning tests (and use more cost-efficient machines). (#31061) 5d81cf09b1 [Docs] Add newsletter to the Ray documentation landing page (#31164) 20499548e6 [RLlib] Policy mapping fn can not be called with keyword arguments. (#31141)

rickyyx avatar Dec 23 '22 01:12 rickyyx

Bisected, and it's due to this PR: https://github.com/ray-project/ray/commit/d57b582d9823c5af5c9be1d4da8bb52e00230024 (https://github.com/ray-project/ray/pull/28209)

cc @jjyao

Before the PR: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_6zhjii2rza1e7bsry6s87wuq?command-history-section=command_history This PR: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_d41pl1vm6lgbienjn6hivaf9

rickyyx avatar Dec 23 '22 18:12 rickyyx

I think the root cause is that we added some additional checking of owner in ray.wait and ray.get in the PR: https://github.com/ray-project/ray/blob/110cae0ca9e8a4601af7ff9fd41f796117a79620/src/ray/core_worker/core_worker.cc#L1375-L1382

This is showing up as extra cost in the many_actors because we called ray.wait with multiple obj refs in a loop: https://github.com/ray-project/ray/blob/e2da42cf16efebec0612e4e1009ac8f82420765e/release/benchmarks/distributed/test_many_actors.py#L34-L36


The flamegraph also confirms this:

  • Revert this PR image

  • With this PR image

So we have at least 2 options:

  1. Revert this PR change: worse observability as pointed out in the original PR but no regression on ray.wait with large number of obj refs
  2. Change the test to not call ray.wait in a loop

cc @scv119

rickyyx avatar Jan 03 '23 21:01 rickyyx

thanks for detailed investigation. I do feel this regression is tolerable since it's all memory access. Also am I understand correct comparing to the added time query owner information; the majority of the time will be spend actually waiting for object to be ready? cc @rkooo567 @iycheng for thought.

scv119 avatar Jan 05 '23 00:01 scv119

Could we also optimize the query? It seems we spend a lot of time constructing some rpc::Address protobuf, when we just want to check if the owner is present.

Also, do we have to do the lookup for every object here?

ericl avatar Jan 05 '23 00:01 ericl

Should we reopen in this case? I don't think we optimized the query from the closed PR?

rkooo567 avatar Jan 06 '23 05:01 rkooo567