substrate RPC response time distribtuion metrics

RPC response time distribtuion metrics

Open bakhtin opened this issue 1 year ago • 2 comments

I'm running multiple Westend RPC nodes. To get more insights about the runtime I'm building a Grafana dashboard. Substrate exposes substrate_rpc_calls_time_bucket metric which shows the response time distribution of RPC calls. The response time is distributed over 12 buckets but most of them are empty.

Querying calls per second for the last 1 day returns the following (sum(rate(substrate_rpc_calls_time_bucket{node=~"westend-rpc-.+"}[1d])) by (le))

>99% of the data falls into the +Inf bucket rendering the metric unusable. I.e., a query for 50th or 95th percentile of the response time (histogram_quantile(0.5, sum(rate(substrate_rpc_calls_time_bucket{node=~"westend-rpc-.+"}[1d])) by (le))) always returns the value of the second to largest bucket (le="10")

I'd suggest to refactor the distribution buckets to be: 5,25,100,500,1000,2500,10000,25000,100000,1000000,10000000,+Inf. It will make the metric more usable in production environments.

Jul 21 '22 13:07 bakhtin

CC @niklasad1

Jul 26 '22 15:07 bkchr

Sure, sounds reasonable to me.

Aug 01 '22 08:08 niklasad1

Closed by #11950

//cc @bakhtin I did what you suggested would be good if you can test it in a burn-in or something on westend :)

Aug 12 '22 15:08 niklasad1

substrate substrate copied to clipboard

RPC response time distribtuion metrics

substrate
substrate copied to clipboard