substrate
substrate copied to clipboard
RPC response time distribtuion metrics
I'm running multiple Westend RPC nodes. To get more insights about the runtime I'm building a Grafana dashboard. Substrate exposes substrate_rpc_calls_time_bucket
metric which shows the response time distribution of RPC calls. The response time is distributed over 12 buckets but most of them are empty.
Querying calls per second for the last 1 day returns the following (sum(rate(substrate_rpc_calls_time_bucket{node=~"westend-rpc-.+"}[1d])) by (le)
)
>99% of the data falls into the +Inf
bucket rendering the metric unusable. I.e., a query for 50th or 95th percentile of the response time (histogram_quantile(0.5, sum(rate(substrate_rpc_calls_time_bucket{node=~"westend-rpc-.+"}[1d])) by (le))
) always returns the value of the second to largest bucket (le="10"
)
I'd suggest to refactor the distribution buckets to be: 5,25,100,500,1000,2500,10000,25000,100000,1000000,10000000,+Inf. It will make the metric more usable in production environments.
CC @niklasad1
Sure, sounds reasonable to me.
Closed by #11950
//cc @bakhtin I did what you suggested would be good if you can test it in a burn-in or something on westend :)