risingwave
risingwave copied to clipboard
batch: Add metrics for tasks.
Add task level metrics, including incoming number of rows and data size of each exchange source, outgoing number of rows and data size of each output.
My notes for this part:
We can collect metrics for batch executor like StreamingMetrics
in streaming_stats.rs.
The metrics "exchange_recv_size" and "exchange_frag_recv_size" can be a good ref for collecting the input data rows of task and out data rows of task (accumulation). Their prs: https://github.com/singularity-data/risingwave/pull/3696
But a seems like we can not get the input recv size for fragment contains table scan: They do not have exchange source as input.
cc @ZENOTME
The problem with this registry is that we need to clean up the registry after query finished. There are two place we should clean up:
- metrics in cn
- metrics in prometheus server
for clean up job, I think we can implement it using 'range search', such as: batch_exchange_recv_row_number{query_id="aaa"} can get all item which query_id equal "aaa". for metrics in prometheus server, it supports 'range search'. for metrics in cn, it can't support 'range search'.
So I think we can clean metrics in prometheus first. Seems complicated to implement clean metrics in cn, we need to record all {queryID, source_stage_id, target_stage_id, source_task_id, target_task_id}. @BowenXiao1999 @liurenjie1024
WDYM by metrics in cn? I think they are all metrics in prometheus
WDYM by metrics in cn? I think they are all metrics in prometheus
I think metrics in cn is a local data structure like this:
pub struct BatchMetrics {
pub row_seq_scan_next_duration: Histogram,
pub exchange_recv_row_number: GenericCounterVec<AtomicU64>,
}
And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?
Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.
What's the difference between metrics in CN and metrics in Prometheus?
WDYM by metrics in cn? I think they are all metrics in prometheus
I think metrics in cn is a local data structure like this:
pub struct BatchMetrics { pub row_seq_scan_next_duration: Histogram, pub exchange_recv_row_number: GenericCounterVec<AtomicU64>, }
And we will send this metrics to prometheus server and then store at the prometheus. (It's metrics in prometheus server?
I think call .delete_label_values
will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.
What's the difference between metrics in CN and metrics in Prometheus?
For example there is a metrics in CN, like this: , (or maybe I should call it a value in prometheus client
#[derive(Debug)]
pub(crate) struct Metric {
value:u64,
}
Metrics are stored in Prometheus server, but they are not sent. Prometheus will pull metrics from each compute node.
And as you say, and this metrics(value) will pull by prometheus and store in prometheus server.
The metrics in prometheus is pull from the metrics in CN.
I think call .delete_label_values will delete all? Metrics in CN and in Prometheus should both be take cared by the lib, and user do not need to care the detail/cache.
I look up the implementation and find delete_label_values will only delete the metrics in CN. (children.remove(&h) I'm not sure it will sync with metrics in Prometheus.
pub fn delete(&self, labels: &HashMap<&str, &str>) -> Result<()> {
let h = self.hash_labels(labels)?;
let mut children = self.children.write();
if children.remove(&h).is_none() { <--------------------------------children is Hash<u64,T>
return Err(Error::Msg(format!("missing labels {:?}", labels)));
}
Ok(())
}
- We don't need to care about deleting metrics in promethues, it will delete by promethues server after some preconfigured interval.
- For deleting task level metrics, here is the changes:
a. Maintain one
BatchMetrics
for eachBatchTaskExecution
b. When a batch task finished/aborted, move it to a deletion queue, which executes deletion of metrics after several minutes. It's important not to delete it immediately since promethues pull data periodically.
Closed