ray
ray copied to clipboard
[core][nightly] many_actors regression
What happened + What you expected to happen
~~Pretty sure it's due to https://github.com/ray-project/ray/pull/31207~~
Versions / Dependencies
master
Reproduction script
NA
Issue Severity
None
Ah, actually the regression happens before #31207.
The last non-regressed run: https://buildkite.com/ray-project/release-tests-branch/builds/1257#01851d3e-e5fb-46c6-a471-6c499d6dc70e The first regressed run: https://buildkite.com/ray-project/release-tests-branch/builds/1259#01852c73-f3c3-429c-8a10-ebb90d7dbac7
f17375d3b4 [Datasets] Fix import of Literal for map_batches benchmark w/ Python 3.7 (#31188)
c6924372f9 [Datasets] [Hotfix] Remove Arrow 8 upper-bound in ray[data] extra (#31186)
1bfc05a47a Fix typo in a comment (#31187)
c91f8615df [Docs/Tune] "getting data in and out of Tune" doc (#30717)
49b1ed180d [RLlib] RLModule: Implement TorchVectorEncoder. (#29875)
9bb62fbe30 [RLlib] Test SampleBatch.get_interceptor functionality. (#31024)
d28552b7b6 [RLlib] Fix tf timeline in ray local mode. (#31139)
608276bb96 Simplify logging configuration. (#30863)
c03c4c6718 [Tune] Fix AxSearch search space conversion for fixed list hyperparameters (#31088)
24b6a8035a [RLlib] Make ComplexInputNet use ModuleDict to register sub-modules. (#31085)
70b0d039da [RLlib] Fix erroring our when tuning any algo without torch being installed. (#31177)
98fef77328 [runtime env] Support python 3.10 for runtime_env conda (#30970)
69b2506c28 [Dashboard New IA] Add overview cards and collapsible sections to the Dashboard (#31013)
8f499013de [docs] Update use cases to include new blog, cleanup
6334e55c7f [Doc] Reference ray core patterns in docstrings (#30964)
d57b582d98 Improving message for access to unknown objects (#28209)
4267ac222b [RLlib] Fix reward collection for OpenSpiel games (#31156)
80d0bc72f7 [Doc] fix the tune's installation instruction in the getting started example (#30280)
66c79f847a Fix tiny typo (#31007)
1829be0a4f [Core] make ray.all work again for static analyzers such as mypy (#31132)
f4032d7f84 [Docs] Fix plasma docs related to arrow (#31172)
f2e955cfb4 [RLlib] Split torch and tf release learning tests (and use more cost-efficient machines). (#31061)
5d81cf09b1 [Docs] Add newsletter to the Ray documentation landing page (#31164)
20499548e6 [RLlib] Policy mapping fn can not be called with keyword arguments. (#31141)
Bisected, and it's due to this PR: https://github.com/ray-project/ray/commit/d57b582d9823c5af5c9be1d4da8bb52e00230024 (https://github.com/ray-project/ray/pull/28209)
cc @jjyao
Before the PR: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_6zhjii2rza1e7bsry6s87wuq?command-history-section=command_history This PR: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_d41pl1vm6lgbienjn6hivaf9
I think the root cause is that we added some additional checking of owner in ray.wait and ray.get in the PR:
https://github.com/ray-project/ray/blob/110cae0ca9e8a4601af7ff9fd41f796117a79620/src/ray/core_worker/core_worker.cc#L1375-L1382
This is showing up as extra cost in the many_actors because we called ray.wait with multiple obj refs in a loop:
https://github.com/ray-project/ray/blob/e2da42cf16efebec0612e4e1009ac8f82420765e/release/benchmarks/distributed/test_many_actors.py#L34-L36
The flamegraph also confirms this:
-
Revert this PR

-
With this PR

So we have at least 2 options:
- Revert this PR change: worse observability as pointed out in the original PR but no regression on
ray.waitwith large number of obj refs - Change the test to not call
ray.waitin a loop
cc @scv119
thanks for detailed investigation. I do feel this regression is tolerable since it's all memory access. Also am I understand correct comparing to the added time query owner information; the majority of the time will be spend actually waiting for object to be ready? cc @rkooo567 @iycheng for thought.
Could we also optimize the query? It seems we spend a lot of time constructing some rpc::Address protobuf, when we just want to check if the owner is present.
Also, do we have to do the lookup for every object here?
Should we reopen in this case? I don't think we optimized the query from the closed PR?