risingwave
risingwave copied to clipboard
feat(batch):add metrics for batch exchange executor
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
add 'receive row number' metrics in batch exchange executor.
As below, it reflects accumulation of received row number between tasks.
The label of this metrics is upstream_id(queryID_stageID_taskID)->downstream_id(queryID_stageID_taskID).
Checklist
- [ ] I have written necessary rustdoc comments
- [ ] I have added necessary unit tests and integration tests
- [ ] All checks passed in
./risedev check
(or alias,./risedev c
)
Refer to a related PR or issue link (optional)
#3832
Codecov Report
Merging #4577 (e51b546) into main (e88dd9a) will increase coverage by
0.10%
. The diff coverage is69.83%
.
@@ Coverage Diff @@
## main #4577 +/- ##
==========================================
+ Coverage 74.08% 74.18% +0.10%
==========================================
Files 862 862
Lines 129157 130209 +1052
==========================================
+ Hits 95685 96596 +911
- Misses 33472 33613 +141
Flag | Coverage Δ | |
---|---|---|
rust | 74.18% <69.83%> (+0.10%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/batch/src/execution/grpc_exchange.rs | 59.52% <0.00%> (-4.58%) |
:arrow_down: |
src/batch/src/execution/local_exchange.rs | 63.88% <0.00%> (-1.83%) |
:arrow_down: |
src/batch/src/executor/row_seq_scan.rs | 17.36% <0.00%> (ø) |
|
src/common/src/types/chrono_wrapper.rs | 87.57% <0.00%> (ø) |
|
src/common/src/util/value_encoding/error.rs | 0.00% <ø> (ø) |
|
...c/compute/src/compute_observer/observer_manager.rs | 0.00% <0.00%> (ø) |
|
src/compute/src/server.rs | 0.00% <0.00%> (ø) |
|
src/ctl/src/cmd_impl/hummock/list_version.rs | 0.00% <0.00%> (ø) |
|
src/ctl/src/cmd_impl/hummock/sst_dump.rs | 0.00% <0.00%> (ø) |
|
src/expr/src/expr/expr_binary_nonnull.rs | 84.00% <ø> (ø) |
|
... and 103 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Following is the code collecting the metrics. The metrics with label "0->0" is the sum of other metrics.
let downstream_id = {
format!(
"{}_{}_{}",
downstream_id.query_id, downstream_id.stage_id, downstream_id.task_id
)
};
let upstream_id = {
format!(
"{}_{}_{}",
upstream_id.query_id, upstream_id.stage_id, upstream_id.task_id
)
};
metrics
.as_ref()
.unwrap()
.exchange_recv_row_number
.with_label_values(&[&upstream_id, &downstream_id])
.inc_by(res.cardinality().try_into().unwrap());
metrics
.as_ref()
.unwrap()
.exchange_recv_row_number
.with_label_values(&["0", "0"])
.inc_by(res.cardinality().try_into().unwrap());
When I run "./risedev slt -p 4566 -d dev './e2e_test/batch/**/*.slt' " , I find the metrics in dashboard is weird. Except the metrics with label "0->0", other metrics are always 0. Is observe way I set is wrong? @BowenXiao1999
Already solved
It's the wrong way I write the prometheus expression...Now the expression has been modify and it will reflect the accumulation of the receive row number of task as below. The label is "upstream_id(queryID-stageID-taskID)->downstream_id(queryID-stageID-taskID)". @BowenXiao1999
Post some pics that describe the metrics?
- Now we collect in ExchangeSource. But for some fragments, the source is Table Scan, we do not collect the metrics.
We can collect this metric in Table Scan or collect as a new metric.
Most LGTM.
I think the design remains some questions to be answered:
- Do we need task level metrics or fragment level? The stream metrics have add fragment level. I'm afraid that too many tasks in future may makes the diagram hard to observe.
- Now we collect in ExchangeSource. But for some fragments, the source is Table Scan, we do not collect the metrics.
@liurenjie1024 PTAL.
I think we need task level metrics, otherwise how to identify data skew?
Generally LGTM. We need other PR to manage metrics so that we can clean up metrics per query after query finished.