nimbus-eth2 icon indicating copy to clipboard operation
nimbus-eth2 copied to clipboard

Metrics endpoint on unstable very slow on Windows

Open jakubgs opened this issue 3 years ago • 5 comments

It appears something is wrong with the metrics endpoint on unstable branch on Windows. I've been seeing slower and slower responses, which eventually result in failure to scrape metrics and healthcheck alerts:

admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9200/metrics | grep '^version{' # stable
version{version="v1.7.0-5b13b7-stateofus",commit="5b13b7"} 1.0

real    0m0.346s
user    0m0.015s
sys     0m0.000s

admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9201/metrics | grep '^version{' # testing
version{version="v1.7.0-5b13b7-stateofus",commit="5b13b7"} 1.0

real    0m0.344s
user    0m0.000s
sys     0m0.046s

admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9202/metrics | grep '^version{' # unstable
version{version="v1.7.0-12ed53-stateofus",commit="12ed53"} 1.0

real    0m4.673s
user    0m0.000s
sys     0m0.030s

admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9202/metrics | grep '^version{' # unstable
curl: (28) Operation timed out after 5012 milliseconds with 0 bytes received

real    0m5.036s
user    0m0.000s
sys     0m0.030s

Something is definitely up.

jakubgs avatar Mar 01 '22 09:03 jakubgs

Worth noting that the unstable node has --validator-monitor-auto while other nodes do not.

jakubgs avatar Mar 01 '22 12:03 jakubgs

Now it even stopped handling connections:

admin@windows-01 MINGW64 /c/Users/nimbus
$ curl -sS localhost:9200/metrics | wc -l
1348

admin@windows-01 MINGW64 /c/Users/nimbus
$ curl -sS localhost:9201/metrics | wc -l
1366

admin@windows-01 MINGW64 /c/Users/nimbus
$ curl -sS localhost:9202/metrics | wc -l
curl: (7) Failed to connect to localhost port 9202: Connection refused
0

jakubgs avatar Mar 01 '22 13:03 jakubgs

The node is still running though:

admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable
$ date
Tue, Mar  1, 2022  2:02:38 PM

admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable
$ tail -n1 logs/beacon-node-prater-unstable.out.log
{"lvl":"DBG","ts":"2022-03-01 14:02:39.890+01:00","msg":"Attestation resolved","topics":"attpool","singles":73,"aggregates":2,"attestation":{"aggregation_bits":"0x0000800000000000000000000000000080","data":{"slot":2469311,"index":4,"beacon_block_root":"49376cbf","source":"77164:0050b4fa","target":"77165:1fc7b1c6"},"signature":"836ef443"}}
admin@windows-01 MINGW64 ~
$ for port in $(seq 9300 9302); do curl -sS "localhost:$port/eth/v1/node/syncing" | jq -c; done
{"data":{"head_slot":"2469315","sync_distance":"0","is_syncing":false}}
{"data":{"head_slot":"2469315","sync_distance":"0","is_syncing":false}}
{"data":{"head_slot":"2469315","sync_distance":"0","is_syncing":false}}

jakubgs avatar Mar 01 '22 13:03 jakubgs

Adding --validator-monitor-totals definitely helps:

admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable
$ time curl -sSf localhost:9202/metrics | wc -l
1645

real    0m0.369s
user    0m0.031s
sys     0m0.000s

So it's clearly related to validators monitoring.

jakubgs avatar Mar 01 '22 14:03 jakubgs

On Linux the metrics without totals were slow, but not 5 seconds slow, more like 1 second max.

jakubgs avatar Mar 01 '22 14:03 jakubgs

No longer relevant.

jakubgs avatar Mar 09 '23 23:03 jakubgs