substrate icon indicating copy to clipboard operation
substrate copied to clipboard

RPC response time distribtuion metrics

Open bakhtin opened this issue 1 year ago • 2 comments

I'm running multiple Westend RPC nodes. To get more insights about the runtime I'm building a Grafana dashboard. Substrate exposes substrate_rpc_calls_time_bucket metric which shows the response time distribution of RPC calls. The response time is distributed over 12 buckets but most of them are empty.

Querying calls per second for the last 1 day returns the following (sum(rate(substrate_rpc_calls_time_bucket{node=~"westend-rpc-.+"}[1d])) by (le)) image

>99% of the data falls into the +Inf bucket rendering the metric unusable. I.e., a query for 50th or 95th percentile of the response time (histogram_quantile(0.5, sum(rate(substrate_rpc_calls_time_bucket{node=~"westend-rpc-.+"}[1d])) by (le))) always returns the value of the second to largest bucket (le="10") image

I'd suggest to refactor the distribution buckets to be: 5,25,100,500,1000,2500,10000,25000,100000,1000000,10000000,+Inf. It will make the metric more usable in production environments.

bakhtin avatar Jul 21 '22 13:07 bakhtin

CC @niklasad1

bkchr avatar Jul 26 '22 15:07 bkchr

Sure, sounds reasonable to me.

niklasad1 avatar Aug 01 '22 08:08 niklasad1

Closed by #11950

//cc @bakhtin I did what you suggested would be good if you can test it in a burn-in or something on westend :)

niklasad1 avatar Aug 12 '22 15:08 niklasad1