smartnode icon indicating copy to clipboard operation
smartnode copied to clipboard

High Latency Metrics Collection on oDAO node

Open mendelskiv93 opened this issue 11 months ago • 3 comments

Performance issue observed on oDAO node with metrics collection taking excessive time to respond, suggesting metrics are collected on-demand during query rather than continuously maintained.

Evidence:

  • Metric endpoint response times:

    • from localhost:
      time curl -s 0:9102/metrics  0.00s user 0.01s system 0% cpu 19.347 total
      
    • from prometheus slave:
      time curl http://10.13.0.58:9102/metrics  0.00s user 0.01s system 0% cpu 44.452 total
      
  • Impact visible in monitoring:

    • Significant increase in TCP socket TIMEWAIT states
    • File descriptors for rocketpool process show elevated numbers
    • No corresponding increase in system load

image image

Suggested improvement: Consider implementing continuous metric collection instead of on-demand gathering during scrape requests to reduce response latency.

mendelskiv93 avatar Jan 08 '25 20:01 mendelskiv93

It is worth mentioning this is happening on an oDAO node.

jakubgs avatar Jan 09 '25 13:01 jakubgs

Thanks for the report.

The metrics collection code is quite old and has always had some less-than ideal qualities (eg https://github.com/rocket-pool/smartnode/issues/186 )

I think we should probably rewrite a lot of it. I'll take a look into the performance regression.

Unfortunately it might have to wait a bit as we're in the middle of merging a very large refactor.

jshufro avatar Jan 09 '25 13:01 jshufro

No worries, we managed to work around this. Thanks for looking into it.

mendelskiv93 avatar Jan 09 '25 14:01 mendelskiv93