I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

add 'receive row number' metrics in batch exchange executor. As below, it reflects accumulation of received row number between tasks. The label of this metrics is upstream_id(queryID_stageID_taskID)->downstream_id(queryID_stageID_taskID).

Checklist

[ ] I have written necessary rustdoc comments
[ ] I have added necessary unit tests and integration tests
[ ] All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

#3832

Aug 11 '22 02:08 ZENOTME

Codecov Report

Merging #4577 (e51b546) into main (e88dd9a) will increase coverage by 0.10%. The diff coverage is 69.83%.

@@            Coverage Diff             @@
##             main    #4577      +/-   ##
==========================================
+ Coverage   74.08%   74.18%   +0.10%     
==========================================
  Files         862      862              
  Lines      129157   130209    +1052     
==========================================
+ Hits        95685    96596     +911     
- Misses      33472    33613     +141

Flag	Coverage Δ
rust	`74.18% <69.83%> (+0.10%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/batch/src/execution/grpc_exchange.rs	`59.52% <0.00%> (-4.58%)`	:arrow_down:
src/batch/src/execution/local_exchange.rs	`63.88% <0.00%> (-1.83%)`	:arrow_down:
src/batch/src/executor/row_seq_scan.rs	`17.36% <0.00%> (ø)`
src/common/src/types/chrono_wrapper.rs	`87.57% <0.00%> (ø)`
src/common/src/util/value_encoding/error.rs	`0.00% <ø> (ø)`
...c/compute/src/compute_observer/observer_manager.rs	`0.00% <0.00%> (ø)`
src/compute/src/server.rs	`0.00% <0.00%> (ø)`
src/ctl/src/cmd_impl/hummock/list_version.rs	`0.00% <0.00%> (ø)`
src/ctl/src/cmd_impl/hummock/sst_dump.rs	`0.00% <0.00%> (ø)`
src/expr/src/expr/expr_binary_nonnull.rs	`84.00% <ø> (ø)`
... and 103 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Aug 11 '22 03:08 codecov[bot]

Following is the code collecting the metrics. The metrics with label "0->0" is the sum of other metrics.

  let downstream_id = {
      format!(
          "{}_{}_{}",
          downstream_id.query_id, downstream_id.stage_id, downstream_id.task_id
      )
  };
  let upstream_id = {
      format!(
          "{}_{}_{}",
          upstream_id.query_id, upstream_id.stage_id, upstream_id.task_id
      )
  };
  metrics
      .as_ref()
      .unwrap()
      .exchange_recv_row_number
      .with_label_values(&[&upstream_id, &downstream_id])
      .inc_by(res.cardinality().try_into().unwrap());
  metrics
      .as_ref()
      .unwrap()
      .exchange_recv_row_number
      .with_label_values(&["0", "0"])
      .inc_by(res.cardinality().try_into().unwrap());

When I run "./risedev slt -p 4566 -d dev './e2e_test/batch/**/*.slt' " , I find the metrics in dashboard is weird. Except the metrics with label "0->0", other metrics are always 0. Is observe way I set is wrong? @BowenXiao1999

Already solved

It's the wrong way I write the prometheus expression...Now the expression has been modify and it will reflect the accumulation of the receive row number of task as below. The label is "upstream_id(queryID-stageID-taskID)->downstream_id(queryID-stageID-taskID)". @BowenXiao1999

Aug 11 '22 03:08 ZENOTME

Post some pics that describe the metrics?

Aug 12 '22 08:08 BowenXiao1999

Now we collect in ExchangeSource. But for some fragments, the source is Table Scan, we do not collect the metrics.

We can collect this metric in Table Scan or collect as a new metric.

Aug 16 '22 08:08 ZENOTME

Most LGTM.

I think the design remains some questions to be answered:

Do we need task level metrics or fragment level? The stream metrics have add fragment level. I'm afraid that too many tasks in future may makes the diagram hard to observe.

Now we collect in ExchangeSource. But for some fragments, the source is Table Scan, we do not collect the metrics.

@liurenjie1024 PTAL.

I think we need task level metrics, otherwise how to identify data skew?

Aug 16 '22 08:08 liurenjie1024

Generally LGTM. We need other PR to manage metrics so that we can clean up metrics per query after query finished.

Aug 16 '22 08:08 liurenjie1024

risingwave
risingwave copied to clipboard

feat(batch):add metrics for batch exchange executor

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

Codecov Report

Already solved

risingwave risingwave copied to clipboard

feat(batch):add metrics for batch exchange executor

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

Codecov Report

Already solved

risingwave
risingwave copied to clipboard