ray [core][nightly] many_actors regression

What happened + What you expected to happen

~~Pretty sure it's due to https://github.com/ray-project/ray/pull/31207~~

Versions / Dependencies

master

Reproduction script

NA

Issue Severity

None

Dec 22 '22 01:12 rickyyx

Ah, actually the regression happens before #31207.

The last non-regressed run: https://buildkite.com/ray-project/release-tests-branch/builds/1257#01851d3e-e5fb-46c6-a471-6c499d6dc70e The first regressed run: https://buildkite.com/ray-project/release-tests-branch/builds/1259#01852c73-f3c3-429c-8a10-ebb90d7dbac7

f17375d3b4 [Datasets] Fix import of Literal for map_batches benchmark w/ Python 3.7 (#31188) c6924372f9 [Datasets] [Hotfix] Remove Arrow 8 upper-bound in ray[data] extra (#31186) 1bfc05a47a Fix typo in a comment (#31187) c91f8615df [Docs/Tune] "getting data in and out of Tune" doc (#30717) 49b1ed180d [RLlib] RLModule: Implement TorchVectorEncoder. (#29875) 9bb62fbe30 [RLlib] Test SampleBatch.get_interceptor functionality. (#31024) d28552b7b6 [RLlib] Fix tf timeline in ray local mode. (#31139) 608276bb96 Simplify logging configuration. (#30863) c03c4c6718 [Tune] Fix AxSearch search space conversion for fixed list hyperparameters (#31088) 24b6a8035a [RLlib] Make ComplexInputNet use ModuleDict to register sub-modules. (#31085) 70b0d039da [RLlib] Fix erroring our when tuning any algo without torch being installed. (#31177) 98fef77328 [runtime env] Support python 3.10 for runtime_env conda (#30970) 69b2506c28 [Dashboard New IA] Add overview cards and collapsible sections to the Dashboard (#31013) 8f499013de [docs] Update use cases to include new blog, cleanup 6334e55c7f [Doc] Reference ray core patterns in docstrings (#30964) d57b582d98 Improving message for access to unknown objects (#28209) 4267ac222b [RLlib] Fix reward collection for OpenSpiel games (#31156) 80d0bc72f7 [Doc] fix the tune's installation instruction in the getting started example (#30280) 66c79f847a Fix tiny typo (#31007) 1829be0a4f [Core] make ray.all work again for static analyzers such as mypy (#31132) f4032d7f84 [Docs] Fix plasma docs related to arrow (#31172) f2e955cfb4 [RLlib] Split torch and tf release learning tests (and use more cost-efficient machines). (#31061) 5d81cf09b1 [Docs] Add newsletter to the Ray documentation landing page (#31164) 20499548e6 [RLlib] Policy mapping fn can not be called with keyword arguments. (#31141)

Dec 23 '22 01:12 rickyyx

Bisected, and it's due to this PR: https://github.com/ray-project/ray/commit/d57b582d9823c5af5c9be1d4da8bb52e00230024 (https://github.com/ray-project/ray/pull/28209)

cc @jjyao

Before the PR: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_6zhjii2rza1e7bsry6s87wuq?command-history-section=command_history This PR: https://console.anyscale.com/o/anyscale-internal/projects/prj_FKRmeV5pA6X72aVscFALNC32/clusters/ses_d41pl1vm6lgbienjn6hivaf9

Dec 23 '22 18:12 rickyyx

I think the root cause is that we added some additional checking of owner in ray.wait and ray.get in the PR: https://github.com/ray-project/ray/blob/110cae0ca9e8a4601af7ff9fd41f796117a79620/src/ray/core_worker/core_worker.cc#L1375-L1382

This is showing up as extra cost in the many_actors because we called ray.wait with multiple obj refs in a loop: https://github.com/ray-project/ray/blob/e2da42cf16efebec0612e4e1009ac8f82420765e/release/benchmarks/distributed/test_many_actors.py#L34-L36

The flamegraph also confirms this:

Revert this PR
With this PR

So we have at least 2 options:

Revert this PR change: worse observability as pointed out in the original PR but no regression on ray.wait with large number of obj refs
Change the test to not call ray.wait in a loop

cc @scv119

Jan 03 '23 21:01 rickyyx

thanks for detailed investigation. I do feel this regression is tolerable since it's all memory access. Also am I understand correct comparing to the added time query owner information; the majority of the time will be spend actually waiting for object to be ready? cc @rkooo567 @iycheng for thought.

Jan 05 '23 00:01 scv119

Could we also optimize the query? It seems we spend a lot of time constructing some rpc::Address protobuf, when we just want to check if the owner is present.

Also, do we have to do the lookup for every object here?

Jan 05 '23 00:01 ericl

Should we reopen in this case? I don't think we optimized the query from the closed PR?

Jan 06 '23 05:01 rkooo567

ray ray copied to clipboard

[core][nightly] many_actors regression

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

ray
ray copied to clipboard