risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

feat(batch):add metrics for batch exchange executor

Open ZENOTME opened this issue 2 years ago • 3 comments

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

add 'receive row number' metrics in batch exchange executor. As below, it reflects accumulation of received row number between tasks. The label of this metrics is upstream_id(queryID_stageID_taskID)->downstream_id(queryID_stageID_taskID). image

Checklist

  • [ ] I have written necessary rustdoc comments
  • [ ] I have added necessary unit tests and integration tests
  • [ ] All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

#3832

ZENOTME avatar Aug 11 '22 02:08 ZENOTME

Codecov Report

Merging #4577 (e51b546) into main (e88dd9a) will increase coverage by 0.10%. The diff coverage is 69.83%.

@@            Coverage Diff             @@
##             main    #4577      +/-   ##
==========================================
+ Coverage   74.08%   74.18%   +0.10%     
==========================================
  Files         862      862              
  Lines      129157   130209    +1052     
==========================================
+ Hits        95685    96596     +911     
- Misses      33472    33613     +141     
Flag Coverage Δ
rust 74.18% <69.83%> (+0.10%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/batch/src/execution/grpc_exchange.rs 59.52% <0.00%> (-4.58%) :arrow_down:
src/batch/src/execution/local_exchange.rs 63.88% <0.00%> (-1.83%) :arrow_down:
src/batch/src/executor/row_seq_scan.rs 17.36% <0.00%> (ø)
src/common/src/types/chrono_wrapper.rs 87.57% <0.00%> (ø)
src/common/src/util/value_encoding/error.rs 0.00% <ø> (ø)
...c/compute/src/compute_observer/observer_manager.rs 0.00% <0.00%> (ø)
src/compute/src/server.rs 0.00% <0.00%> (ø)
src/ctl/src/cmd_impl/hummock/list_version.rs 0.00% <0.00%> (ø)
src/ctl/src/cmd_impl/hummock/sst_dump.rs 0.00% <0.00%> (ø)
src/expr/src/expr/expr_binary_nonnull.rs 84.00% <ø> (ø)
... and 103 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Aug 11 '22 03:08 codecov[bot]

Following is the code collecting the metrics. The metrics with label "0->0" is the sum of other metrics.

  let downstream_id = {
      format!(
          "{}_{}_{}",
          downstream_id.query_id, downstream_id.stage_id, downstream_id.task_id
      )
  };
  let upstream_id = {
      format!(
          "{}_{}_{}",
          upstream_id.query_id, upstream_id.stage_id, upstream_id.task_id
      )
  };
  metrics
      .as_ref()
      .unwrap()
      .exchange_recv_row_number
      .with_label_values(&[&upstream_id, &downstream_id])
      .inc_by(res.cardinality().try_into().unwrap());
  metrics
      .as_ref()
      .unwrap()
      .exchange_recv_row_number
      .with_label_values(&["0", "0"])
      .inc_by(res.cardinality().try_into().unwrap());

When I run "./risedev slt -p 4566 -d dev './e2e_test/batch/**/*.slt' " , I find the metrics in dashboard is weird. Except the metrics with label "0->0", other metrics are always 0. Is observe way I set is wrong? @BowenXiao1999 image


Already solved

It's the wrong way I write the prometheus expression...Now the expression has been modify and it will reflect the accumulation of the receive row number of task as below. The label is "upstream_id(queryID-stageID-taskID)->downstream_id(queryID-stageID-taskID)". @BowenXiao1999 image

ZENOTME avatar Aug 11 '22 03:08 ZENOTME

Post some pics that describe the metrics?

BowenXiao1999 avatar Aug 12 '22 08:08 BowenXiao1999

  • Now we collect in ExchangeSource. But for some fragments, the source is Table Scan, we do not collect the metrics.

We can collect this metric in Table Scan or collect as a new metric.

ZENOTME avatar Aug 16 '22 08:08 ZENOTME

Most LGTM.

I think the design remains some questions to be answered:

  • Do we need task level metrics or fragment level? The stream metrics have add fragment level. I'm afraid that too many tasks in future may makes the diagram hard to observe.
  • Now we collect in ExchangeSource. But for some fragments, the source is Table Scan, we do not collect the metrics.

@liurenjie1024 PTAL.

I think we need task level metrics, otherwise how to identify data skew?

liurenjie1024 avatar Aug 16 '22 08:08 liurenjie1024

Generally LGTM. We need other PR to manage metrics so that we can clean up metrics per query after query finished.

liurenjie1024 avatar Aug 16 '22 08:08 liurenjie1024