feat(streaming): introduce experimental multiplexed exchange
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Currently, our model of remote exchange is quite simple: we issue a gRPC streaming request for each upstream vs downstream actor pair on different compute nodes, relying on gRPC's multiplexing mechanism to reuse the same HTTP connection for multiple actor pairs.
We've recently discovered some issues that, when there are too many jobs with high parallelism, the messages (especially Barrier) duplicated with N * N times on the wire will lead to performance issue, even when the cluster is idle and there's only barrier in flight.
Some ideas for optimizing this have been proposed (see https://github.com/risingwavelabs/risingwave/issues/22726). However, for implementation, it appears necessary to process messages from different actor pairs together, rather than the current approach of allowing them to run independently. Thus, this PR introduces a new experimental implementation of multiplexed exchange for all actor pairs between a (fragment, compute-node) pair. Basically, we are tagging messages with the actor pair as key, using a single gRPC streaming call to deliver the data and control message, then disaggregate the messages to different actors on the receiver side.
In this PR, we are focusing on the functionality and correctness, avoiding introducing any regression on remote exchange. Some config options are introduced so that this can live along with the original, simple, and validated implementation. We can try optimizations in future PRs.
Checklist
- [x] I have written necessary rustdoc comments.
- [x] I have added necessary unit tests and integration tests.
- [ ] I have added test labels as necessary.
- [ ] I have added fuzzing tests or opened an issue to track them.
- [ ] My PR contains breaking changes.
- [ ] My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
- [ ] I have checked the Release Timeline and Currently Supported Versions to determine which release branches I need to cherry-pick this PR into.
Documentation
- [ ] My PR needs documentation updates.
Release note
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.