nearcore icon indicating copy to clipboard operation
nearcore copied to clipboard

Add 'heartbeat' metric for near client_actor

Open mm-near opened this issue 2 years ago • 3 comments

We should have a metric that tracks the 'health' of the client actor (this would have been helpful in debugging yesterday's issue).

What I'd suggest - is having a metric that measures the time between our 'log_summary' runs.

Normally these, should happen every 10 seconds - and if there is any change to this period - it means that the client_actor thread is getting stuck on some longer operations - which would be a good alert to have for our canary nodes.

mm-near avatar Aug 12 '22 10:08 mm-near

What I'd suggest - is having a metric that measures the time between our 'log_summary' runs.

Sounds like a counter incremented at the start of log_summary.

mina86 avatar Aug 12 '22 14:08 mina86

A plausible alternative is to have a span! around each "event loop turn", so that we can see directly when the loop is blocked. actix doesn't make it particularly easy to instrument event loop, but it seems doable: https://github.com/near/nearcore/pull/7398. Could try to cook up something prod ready out of it.

matklad avatar Aug 12 '22 16:08 matklad

@mina86 - yep, a counter + proper alert on the grafana should do it.

@matklad - here I was thinking about something 'very simple' - that could alert us automatically to catch the issues like the recent one in canary.

mm-near avatar Aug 16 '22 16:08 mm-near