ray icon indicating copy to clipboard operation
ray copied to clipboard

[GCS] Add `total_resources` to reduce communication overhead for aggregating alive resources

Open liuxsh9 opened this issue 1 year ago • 0 comments

Why are these changes needed?

Currently, when Ray need to collect total resources of all alive nodes, such as selecting executable operators in Ray data, it first retrieve the node_table which includes dead nodes, then filter out the alive resources. Due to the redundant information in the node_table, this process incurs significant overhead. When using Ray 2.9, @Bye-legumes @nemo9cby and I observed close to 80MB/s of communication overhead while running multiple Ray data jobs in a 100-node cluster.

This PR addresses this problem by adding GLOBAL_LIMITS_UPDATE_INTERVAL_S to reduce the refresh frequency of cluster resources in Ray data. However, reducing redundant information can further decrease communication overhead and improve the scalability.

To achieve this, we introduce a new TotalResources object for aggregating total alive resources to reduce the network overhead of cluster_resources API. The reduction is shown below.

image

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

liuxsh9 avatar Apr 28 '24 09:04 liuxsh9