envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Export Common Duration Metrics

Open ldb opened this issue 1 year ago • 4 comments
trafficstars

Title: Export Common Duration Metrics

Description: With https://github.com/envoyproxy/envoy/pull/33240 we got the ability to export various commonly used durations via access logs (thank you!!!). However, it would be great if there was a way to also export these as metrics so they can be ingested by Prometheus.

I don't have a specific design in mind right now, but anything that would pre-aggregate these durations would help immensely in ensuring we can easily alert on the performance of downstream, upstream and envoy itself.

ldb avatar May 15 '24 10:05 ldb

@wbpcode about the specific data, @jmarantz as a stats expert

ravenblackx avatar May 16 '24 15:05 ravenblackx

It's not easy to provide a feature like this in our core stats system. Dynamic and flexible stats means additional memory, additional complexity. (And I think it's complex enough)

But the good news is our stats is extendable. I am okay if we do it in an optional filter (or logger? @kyessenov )

wbpcode avatar May 17 '24 13:05 wbpcode

I filed #30619 which replicates the Istio design with high cardinality metrics so you can do break downs by upstream/downstream paths easily. The general problem is that doing all of this in Envoy would push its stats subsystem beyond its design capabilities, so you still need to run a collector or some stats engine to hold the aggregate data. I'd recommend using delta aggregation temporality as well to flush metrics which Envoy doesn't directly support it.

kyessenov avatar May 20 '24 20:05 kyessenov

To be clear, what I am mostly looking for is to have specific metrics available for the kind of deltas that #33240 enables, for example:

ds_rx_duration: '%COMMON_DURATION(DS_RX_BEG:DS_RX_END:ms)%',  // Total duration in milliseconds of the request from the start time to the last byte of the request received from the downstream.
routing_duration: '%COMMON_DURATION(DS_RX_END:US_TX_BEG:ms)%',  // Total duration in milliseconds of the request from the last byte of the request received from the downstream to the first byte of the response sent to the upstream.
us_tx_duration: '%COMMON_DURATION(US_TX_BEG:US_TX_END:ms)%',  // Total duration in milliseconds of the request from the first byte of the response sent to the upstream to the last byte of the response sent to the downstream.
us_rx_duration: '%COMMON_DURATION(US_RX_BEG:US_RX_END:ms)%',  // Total duration in milliseconds of the request from the last byte of the response received from the upstream to the first byte of the response sent to the downstream.
ds_tx_duration: '%COMMON_DURATION(DS_TX_BEG:DS_TX_END:ms)%',  // Total duration in milliseconds of the request from the first byte of the response sent to the downstream to the last byte of the response sent to the downstream.

We currently expose these in access logs, but aggregating these into metrics is quite an expensive process if all we are after is some aggregates per cluster / method / status.

These metrics do not need the same granularity (read: cardinality) as the access logs, an aggregation by upstream cluster, HTTP method and HTTP status would already be a very useful start.

I do like the idea of this being added as an optional filter, too. The metrics could be created dynamically and if the set of potential attributes is limited, cardinality should not be a big problem.

ldb avatar May 21 '24 07:05 ldb

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 20 '24 08:06 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

github-actions[bot] avatar Jun 27 '24 16:06 github-actions[bot]