risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

batch: Add metrics for tasks.

Open liurenjie1024 opened this issue 1 year ago • 1 comments

Add task level metrics, including incoming number of rows and data size of each exchange source, outgoing number of rows and data size of each output.

liurenjie1024 avatar Jul 13 '22 06:07 liurenjie1024

My notes for this part:

We can collect metrics for batch executor like StreamingMetrics in streaming_stats.rs.

The metrics "exchange_recv_size" and "exchange_frag_recv_size" can be a good ref for collecting the input data rows of task and out data rows of task (accumulation). Their prs: https://github.com/singularity-data/risingwave/pull/3696

But a seems like we can not get the input recv size for fragment contains table scan: They do not have exchange source as input.

cc @ZENOTME

BowenXiao1999 avatar Aug 08 '22 09:08 BowenXiao1999

The problem with this registry is that we need to clean up the registry after query finished. There are two place we should clean up:

  • metrics in cn
  • metrics in prometheus server

for clean up job, I think we can implement it using 'range search', such as: batch_exchange_recv_row_number{query_id="aaa"} can get all item which query_id equal "aaa". for metrics in prometheus server, it supports 'range search'. for metrics in cn, it can't support 'range search'.

So I think we can clean metrics in prometheus first. Seems complicated to implement clean metrics in cn, we need to record all {queryID, source_stage_id, target_stage_id, source_task_id, target_task_id}. @BowenXiao1999 @liurenjie1024

ZENOTME avatar Aug 17 '22 05:08 ZENOTME

WDYM by metrics in cn? I think they are all metrics in prometheus

BowenXiao1999 avatar Aug 17 '22 05:08 BowenXiao1999

WDYM by metrics in cn? I think they are all metrics in prometheus

I think metrics in cn is a local data structure like this:

pub struct BatchMetrics {
    pub row_seq_scan_next_duration: Histogram,
    pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}

And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?

ZENOTME avatar Aug 17 '22 05:08 ZENOTME

Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.

skyzh avatar Aug 17 '22 05:08 skyzh

What's the difference between metrics in CN and metrics in Prometheus?

skyzh avatar Aug 17 '22 05:08 skyzh

WDYM by metrics in cn? I think they are all metrics in prometheus

I think metrics in cn is a local data structure like this:

pub struct BatchMetrics {
    pub row_seq_scan_next_duration: Histogram,
    pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}

And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?

I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.

BowenXiao1999 avatar Aug 17 '22 06:08 BowenXiao1999

What's the difference between metrics in CN and metrics in Prometheus?

For example there is a metrics in CN, like this: , (or maybe I should call it a value in prometheus client

#[derive(Debug)]
pub(crate) struct Metric {
   value:u64,
}

Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.

And as you say, and this metrics(value) will pull by prometheus and store in prometheus server.

The metrics in prometheus is pull from the metrics in CN.


I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.

I look up the implementation and find delete_label_values will only delete the metrics in CN. (children.remove(&h) I'm not sure it will sync with metrics in Prometheus.

 pub fn delete(&self, labels: &HashMap<&str, &str>) -> Result<()> {
        let h = self.hash_labels(labels)?;

        let mut children = self.children.write();
        if children.remove(&h).is_none() {   <--------------------------------children is Hash<u64,T>
            return Err(Error::Msg(format!("missing labels {:?}", labels)));
        }

        Ok(())
    }

ZENOTME avatar Aug 17 '22 06:08 ZENOTME

  1. We don't need to care about deleting metrics in promethues, it will delete by promethues server after some preconfigured interval.
  2. For deleting task level metrics, here is the changes: a. Maintain one BatchMetrics for each BatchTaskExecution b. When a batch task finished/aborted, move it to a deletion queue, which executes deletion of metrics after several minutes. It's important not to delete it immediately since promethues pull data periodically.

liurenjie1024 avatar Aug 17 '22 06:08 liurenjie1024

Closed

BowenXiao1999 avatar Sep 22 '22 05:09 BowenXiao1999