smartnode High Latency Metrics Collection on oDAO node

Performance issue observed on oDAO node with metrics collection taking excessive time to respond, suggesting metrics are collected on-demand during query rather than continuously maintained.

Evidence:

Metric endpoint response times:

from localhost:

time curl -s 0:9102/metrics  0.00s user 0.01s system 0% cpu 19.347 total

from prometheus slave:

time curl http://10.13.0.58:9102/metrics  0.00s user 0.01s system 0% cpu 44.452 total

Impact visible in monitoring:
- Significant increase in TCP socket TIMEWAIT states
- File descriptors for rocketpool process show elevated numbers
- No corresponding increase in system load

Suggested improvement: Consider implementing continuous metric collection instead of on-demand gathering during scrape requests to reduce response latency.

Jan 08 '25 20:01 mendelskiv93

It is worth mentioning this is happening on an oDAO node.

Jan 09 '25 13:01 jakubgs

Thanks for the report.

The metrics collection code is quite old and has always had some less-than ideal qualities (eg https://github.com/rocket-pool/smartnode/issues/186 )

I think we should probably rewrite a lot of it. I'll take a look into the performance regression.

Unfortunately it might have to wait a bit as we're in the middle of merging a very large refactor.

Jan 09 '25 13:01 jshufro

No worries, we managed to work around this. Thanks for looking into it.

Jan 09 '25 14:01 mendelskiv93