[GCS] Add `total_resources` to reduce communication overhead for aggregating alive resources
Why are these changes needed?
Currently, when Ray need to collect total resources of all alive nodes, such as selecting executable operators in Ray data, it first retrieve the node_table which includes dead nodes, then filter out the alive resources. Due to the redundant information in the node_table, this process incurs significant overhead. When using Ray 2.9, @Bye-legumes @nemo9cby and I observed close to 80MB/s of communication overhead while running multiple Ray data jobs in a 100-node cluster.
This PR addresses this problem by adding GLOBAL_LIMITS_UPDATE_INTERVAL_S to reduce the refresh frequency of cluster resources in Ray data. However, reducing redundant information can further decrease communication overhead and improve the scalability.
To achieve this, we introduce a new TotalResources object for aggregating total alive resources to reduce the network overhead of cluster_resources API. The reduction is shown below.
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(