High Latency Metrics Collection on oDAO node
Performance issue observed on oDAO node with metrics collection taking excessive time to respond, suggesting metrics are collected on-demand during query rather than continuously maintained.
Evidence:
-
Metric endpoint response times:
- from localhost:
time curl -s 0:9102/metrics 0.00s user 0.01s system 0% cpu 19.347 total - from prometheus slave:
time curl http://10.13.0.58:9102/metrics 0.00s user 0.01s system 0% cpu 44.452 total
- from localhost:
-
Impact visible in monitoring:
- Significant increase in TCP socket TIMEWAIT states
- File descriptors for rocketpool process show elevated numbers
- No corresponding increase in system load
Suggested improvement: Consider implementing continuous metric collection instead of on-demand gathering during scrape requests to reduce response latency.
It is worth mentioning this is happening on an oDAO node.
Thanks for the report.
The metrics collection code is quite old and has always had some less-than ideal qualities (eg https://github.com/rocket-pool/smartnode/issues/186 )
I think we should probably rewrite a lot of it. I'll take a look into the performance regression.
Unfortunately it might have to wait a bit as we're in the middle of merging a very large refactor.
No worries, we managed to work around this. Thanks for looking into it.