skywalking
skywalking copied to clipboard
[Feature] Implement pre-aggregation on data nodes
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Description
Problem
Currently, raw data points are transported to the liaison node for deduplication and aggregation from multiple replicas. This approach creates performance bottlenecks, as all raw data must be transferred over the network before any processing occurs, resulting in increased latency and network overhead.
Proposed Solution
Implement a pre-aggregation mechanism on data nodes that selects all replicas to perform initial aggregation before sending results to the liaison node. This will significantly reduce the amount of data transferred and improve overall query performance.
Implementation Requirements
All Replica Selection:
- Ensure the same replica is consistently chosen as the default result.
- Handle replica availability and failover scenarios gracefully
Pre-aggregation on Data Nodes:
- Implement aggregation logic on the selected primary replica
- Support common aggregation operations (sum, count, mean, min, max, etc.)
- Ensure partial aggregation results can be properly combined at the liaison node
- Maintain compatibility with existing deduplication mechanisms
Use case
No response
Related issues
No response
Are you willing to submit a pull request to implement this on your own?
- [ ] Yes I am willing to submit a pull request on my own!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Please assign to me
As a result of the design review:
- Each data node will generate a preliminary aggregated result to send to the liaison node, which will handle deduplication and perform the final aggregation.
- A new distributed query plan strategy will be introduced to support semi-aggregated results.
- The mean/average function presents extra challenges and should be prioritized for implementation.
I think this would affect no limit query, right?