risingwave batch: Add metrics for tasks.

batch: Add metrics for tasks.

Open liurenjie1024 opened this issue 1 year ago • 1 comments

Add task level metrics, including incoming number of rows and data size of each exchange source, outgoing number of rows and data size of each output.

Jul 13 '22 06:07 liurenjie1024

My notes for this part:

We can collect metrics for batch executor like StreamingMetrics in streaming_stats.rs.

The metrics "exchange_recv_size" and "exchange_frag_recv_size" can be a good ref for collecting the input data rows of task and out data rows of task (accumulation). Their prs: https://github.com/singularity-data/risingwave/pull/3696

But a seems like we can not get the input recv size for fragment contains table scan: They do not have exchange source as input.

cc @ZENOTME

Aug 08 '22 09:08 BowenXiao1999

The problem with this registry is that we need to clean up the registry after query finished. There are two place we should clean up:

metrics in cn
metrics in prometheus server

for clean up job, I think we can implement it using 'range search', such as: batch_exchange_recv_row_number{query_id="aaa"} can get all item which query_id equal "aaa". for metrics in prometheus server, it supports 'range search'. for metrics in cn, it can't support 'range search'.

So I think we can clean metrics in prometheus first. Seems complicated to implement clean metrics in cn, we need to record all {queryID, source_stage_id, target_stage_id, source_task_id, target_task_id}. @BowenXiao1999 @liurenjie1024

Aug 17 '22 05:08 ZENOTME

WDYM by metrics in cn? I think they are all metrics in prometheus

Aug 17 '22 05:08 BowenXiao1999

WDYM by metrics in cn? I think they are all metrics in prometheus

I think metrics in cn is a local data structure like this:

pub struct BatchMetrics {
    pub row_seq_scan_next_duration: Histogram,
    pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}

And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?

Aug 17 '22 05:08 ZENOTME

Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.

Aug 17 '22 05:08 skyzh

What's the difference between metrics in CN and metrics in Prometheus?

Aug 17 '22 05:08 skyzh

WDYM by metrics in cn? I think they are all metrics in prometheus

I think metrics in cn is a local data structure like this:
pub struct BatchMetrics {
    pub row_seq_scan_next_duration: Histogram,
    pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}
And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?

I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.

Aug 17 '22 06:08 BowenXiao1999

What's the difference between metrics in CN and metrics in Prometheus?

For example there is a metrics in CN, like this: , (or maybe I should call it a value in prometheus client

#[derive(Debug)]
pub(crate) struct Metric {
   value:u64,
}

Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.

And as you say, and this metrics(value) will pull by prometheus and store in prometheus server.

The metrics in prometheus is pull from the metrics in CN.

I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.

I look up the implementation and find delete_label_values will only delete the metrics in CN. (children.remove(&h) I'm not sure it will sync with metrics in Prometheus.

 pub fn delete(&self, labels: &HashMap<&str, &str>) -> Result<()> {
        let h = self.hash_labels(labels)?;

        let mut children = self.children.write();
        if children.remove(&h).is_none() {   <--------------------------------children is Hash<u64,T>
            return Err(Error::Msg(format!("missing labels {:?}", labels)));
        }

        Ok(())
    }

Aug 17 '22 06:08 ZENOTME

We don't need to care about deleting metrics in promethues, it will delete by promethues server after some preconfigured interval.
For deleting task level metrics, here is the changes: a. Maintain one BatchMetrics for each BatchTaskExecution b. When a batch task finished/aborted, move it to a deletion queue, which executes deletion of metrics after several minutes. It's important not to delete it immediately since promethues pull data periodically.

Aug 17 '22 06:08 liurenjie1024

Closed

Sep 22 '22 05:09 BowenXiao1999

risingwave risingwave copied to clipboard

batch: Add metrics for tasks.

risingwave
risingwave copied to clipboard