lodestar
lodestar copied to clipboard
v1.21.0-rc.0: unstable peers
Describe the bug
Peer count is unstable on the beta mainnet node, I used to see it dropped to 35 while inbound peer is 0 or 1
it's likely due to Event Loop Lag on Network worker (which is only ~15ms on the stable mainnet node)
Expected behavior
- Peer count should be ~100 stably
- Inbound peers should be ~90 as in the stable mainnet node (or around that value)
- Event loop lag Network Worker should be ~15ms
Steps to reproduce
No response
Additional context
No response
Operating system
Linux
Lodestar version or commit hash
v1.21.0-rc.0
This is on our unstable mainnet node
the issue started happening from around Jul 27 14:00:00 UTC so #5886 could be the root cause
It does not look that bad on my mainnet nodes
mainnet-1
mainnet-2
mainnet-3
On all of the nodes I can also see an improved event loop lag metric
Let me know if there are other metrics to share that might be useful
@nflaig it only happens on a test mainnet node because it subscribes to all subnets, so it has a lot of I/O load
I don't see the issue on our cip nodes too
@nflaig it only happens on a test mainnet node because it subscribes to all subnets, so it has a lot of I/O load I don't see the issue on our
cipnodes too
Right, that's likely it. Will enable it on one of my mainnet nodes as well now
After putting in rc.1 to beta, you can see the stabilization in some peer metrics here:
![Screenshot 2024-08-02
attaching a network profile of cayman/yamux branch cc @wemeetagain
mainnet_yamux_network_thread_2024-08-06.cpuprofile.zip
there are so many dial errors in that branch
from the log:
grep -e "unexpected end of input" -rn beacon-2024-08-06.log | wc -l
19724
grep -e "REQUEST_ERROR_DIAL_ERROR" -rn beacon-2024-08-06.log | wc -l
14302
sample error stack trace does not seem relate to yamux
Aug-06 00:02:30.296[network] ^[[36mverbose^[[39m: Req error method=ping, version=1, encoding=ssz_snappy, client=NA, peer=16...kfMnnz, requestId=272488, code=REQUEST_ERROR_DIAL_ERROR, error=unexpected end of input
Error: unexpected end of input
at Object.read (file:///usr/src/lodestar/node_modules/it-byte-stream/dist/src/index.js:86:31)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async Object.read (file:///usr/src/lodestar/node_modules/it-length-prefixed-stream/dist/src/index.js:51:37)
at async read (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/multistream.js:22:17)
at async Module.readString (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/multistream.js:33:17)
at async Module.select (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/select.js:72:20)
at async ConnectionImpl.newStream [as _newStream] (file:///usr/src/lodestar/node_modules/libp2p/dist/src/upgrader.js:330:50)
at async ConnectionImpl.newStream (file:///usr/src/lodestar/node_modules/libp2p/dist/src/connection/index.js:94:24)
at async file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:57:26
at async withTimeout (file:///usr/src/lodestar/packages/utils/lib/timeout.js:12:16)
Error: REQUEST_ERROR_DIAL_ERROR
at file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:67:23
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async sendRequest (file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:55:24)
at async ReqRespBeaconNode.sendRequest (file:///usr/src/lodestar/packages/reqresp/lib/ReqResp.js:109:13)
at async collectExactOneTyped (file:///usr/src/lodestar/packages/beacon-node/lib/network/reqresp/utils/collect.js:10:22)
at async PeerManager.requestPing (file:///usr/src/lodestar/packages/beacon-node/lib/network/peers/peerManager.js:340:31)
issue was resolved because we reverted yamux enhancement