nimbus-eth2
nimbus-eth2 copied to clipboard
Metrics endpoint on unstable very slow on Windows
It appears something is wrong with the metrics endpoint on unstable branch on Windows. I've been seeing slower and slower responses, which eventually result in failure to scrape metrics and healthcheck alerts:
admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9200/metrics | grep '^version{' # stable
version{version="v1.7.0-5b13b7-stateofus",commit="5b13b7"} 1.0
real 0m0.346s
user 0m0.015s
sys 0m0.000s
admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9201/metrics | grep '^version{' # testing
version{version="v1.7.0-5b13b7-stateofus",commit="5b13b7"} 1.0
real 0m0.344s
user 0m0.000s
sys 0m0.046s
admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9202/metrics | grep '^version{' # unstable
version{version="v1.7.0-12ed53-stateofus",commit="12ed53"} 1.0
real 0m4.673s
user 0m0.000s
sys 0m0.030s
admin@windows-01 MINGW64 ~
$ time curl -sSfm5 localhost:9202/metrics | grep '^version{' # unstable
curl: (28) Operation timed out after 5012 milliseconds with 0 bytes received
real 0m5.036s
user 0m0.000s
sys 0m0.030s
Something is definitely up.
Worth noting that the unstable node has --validator-monitor-auto while other nodes do not.
Now it even stopped handling connections:
admin@windows-01 MINGW64 /c/Users/nimbus
$ curl -sS localhost:9200/metrics | wc -l
1348
admin@windows-01 MINGW64 /c/Users/nimbus
$ curl -sS localhost:9201/metrics | wc -l
1366
admin@windows-01 MINGW64 /c/Users/nimbus
$ curl -sS localhost:9202/metrics | wc -l
curl: (7) Failed to connect to localhost port 9202: Connection refused
0
The node is still running though:
admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable
$ date
Tue, Mar 1, 2022 2:02:38 PM
admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable
$ tail -n1 logs/beacon-node-prater-unstable.out.log
{"lvl":"DBG","ts":"2022-03-01 14:02:39.890+01:00","msg":"Attestation resolved","topics":"attpool","singles":73,"aggregates":2,"attestation":{"aggregation_bits":"0x0000800000000000000000000000000080","data":{"slot":2469311,"index":4,"beacon_block_root":"49376cbf","source":"77164:0050b4fa","target":"77165:1fc7b1c6"},"signature":"836ef443"}}
admin@windows-01 MINGW64 ~
$ for port in $(seq 9300 9302); do curl -sS "localhost:$port/eth/v1/node/syncing" | jq -c; done
{"data":{"head_slot":"2469315","sync_distance":"0","is_syncing":false}}
{"data":{"head_slot":"2469315","sync_distance":"0","is_syncing":false}}
{"data":{"head_slot":"2469315","sync_distance":"0","is_syncing":false}}
Adding --validator-monitor-totals definitely helps:
admin@windows-01 MINGW64 .../nimbus/beacon-node-prater-unstable
$ time curl -sSf localhost:9202/metrics | wc -l
1645
real 0m0.369s
user 0m0.031s
sys 0m0.000s
So it's clearly related to validators monitoring.
On Linux the metrics without totals were slow, but not 5 seconds slow, more like 1 second max.
No longer relevant.