nearcore
nearcore copied to clipboard
Add 'heartbeat' metric for near client_actor
We should have a metric that tracks the 'health' of the client actor (this would have been helpful in debugging yesterday's issue).
What I'd suggest - is having a metric that measures the time between our 'log_summary' runs.
Normally these, should happen every 10 seconds - and if there is any change to this period - it means that the client_actor thread is getting stuck on some longer operations - which would be a good alert to have for our canary nodes.
What I'd suggest - is having a metric that measures the time between our 'log_summary' runs.
Sounds like a counter incremented at the start of log_summary.
A plausible alternative is to have a span!
around each "event loop turn", so that we can see directly when the loop is blocked. actix
doesn't make it particularly easy to instrument event loop, but it seems doable: https://github.com/near/nearcore/pull/7398. Could try to cook up something prod ready out of it.
@mina86 - yep, a counter + proper alert on the grafana should do it.
@matklad - here I was thinking about something 'very simple' - that could alert us automatically to catch the issues like the recent one in canary.