lodestar icon indicating copy to clipboard operation
lodestar copied to clipboard

v1.21.0-rc.0: unstable peers

Open twoeths opened this issue 1 year ago • 6 comments

Describe the bug

Peer count is unstable on the beta mainnet node, I used to see it dropped to 35 while inbound peer is 0 or 1

Screenshot 2024-07-31 at 13 35 08

it's likely due to Event Loop Lag on Network worker (which is only ~15ms on the stable mainnet node)

Screenshot 2024-07-31 at 13 36 06

Expected behavior

  • Peer count should be ~100 stably
  • Inbound peers should be ~90 as in the stable mainnet node (or around that value)
  • Event loop lag Network Worker should be ~15ms

Steps to reproduce

No response

Additional context

No response

Operating system

Linux

Lodestar version or commit hash

v1.21.0-rc.0

twoeths avatar Jul 31 '24 06:07 twoeths

This is on our unstable mainnet node

Screenshot 2024-07-31 at 13 39 17

the issue started happening from around Jul 27 14:00:00 UTC so #5886 could be the root cause

twoeths avatar Jul 31 '24 06:07 twoeths

It does not look that bad on my mainnet nodes

mainnet-1

image

mainnet-2

image

mainnet-3

image

On all of the nodes I can also see an improved event loop lag metric

image

Let me know if there are other metrics to share that might be useful

nflaig avatar Jul 31 '24 08:07 nflaig

@nflaig it only happens on a test mainnet node because it subscribes to all subnets, so it has a lot of I/O load I don't see the issue on our cip nodes too

twoeths avatar Jul 31 '24 13:07 twoeths

@nflaig it only happens on a test mainnet node because it subscribes to all subnets, so it has a lot of I/O load I don't see the issue on our cip nodes too

Right, that's likely it. Will enable it on one of my mainnet nodes as well now

nflaig avatar Jul 31 '24 13:07 nflaig

After putting in rc.1 to beta, you can see the stabilization in some peer metrics here:

Screenshot 2024-08-02 at 11 54 08 AM Screenshot 2024-08-02 at 11 58 20 AM ![Screenshot 2024-08-02 Screenshot 2024-08-02 at 12 00 47 PM Screenshot 2024-08-02 at 12 00 52 PM Screenshot 2024-08-02 at 11 58 07 AM

philknows avatar Aug 02 '24 16:08 philknows

attaching a network profile of cayman/yamux branch cc @wemeetagain

mainnet_yamux_network_thread_2024-08-06.cpuprofile.zip

there are so many dial errors in that branch

Screenshot 2024-08-06 at 15 04 41

from the log:

grep -e "unexpected end of input" -rn beacon-2024-08-06.log | wc -l
19724
grep -e "REQUEST_ERROR_DIAL_ERROR" -rn beacon-2024-08-06.log | wc -l
14302

sample error stack trace does not seem relate to yamux

Aug-06 00:02:30.296[network]       ^[[36mverbose^[[39m: Req  error method=ping, version=1, encoding=ssz_snappy, client=NA, peer=16...kfMnnz, requestId=272488, code=REQUEST_ERROR_DIAL_ERROR, error=unexpected end of input
Error: unexpected end of input
    at Object.read (file:///usr/src/lodestar/node_modules/it-byte-stream/dist/src/index.js:86:31)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.read (file:///usr/src/lodestar/node_modules/it-length-prefixed-stream/dist/src/index.js:51:37)
    at async read (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/multistream.js:22:17)
    at async Module.readString (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/multistream.js:33:17)
    at async Module.select (file:///usr/src/lodestar/node_modules/@libp2p/multistream-select/dist/src/select.js:72:20)
    at async ConnectionImpl.newStream [as _newStream] (file:///usr/src/lodestar/node_modules/libp2p/dist/src/upgrader.js:330:50)
    at async ConnectionImpl.newStream (file:///usr/src/lodestar/node_modules/libp2p/dist/src/connection/index.js:94:24)
    at async file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:57:26
    at async withTimeout (file:///usr/src/lodestar/packages/utils/lib/timeout.js:12:16)
Error: REQUEST_ERROR_DIAL_ERROR
    at file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:67:23
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async sendRequest (file:///usr/src/lodestar/packages/reqresp/lib/request/index.js:55:24)
    at async ReqRespBeaconNode.sendRequest (file:///usr/src/lodestar/packages/reqresp/lib/ReqResp.js:109:13)
    at async collectExactOneTyped (file:///usr/src/lodestar/packages/beacon-node/lib/network/reqresp/utils/collect.js:10:22)
    at async PeerManager.requestPing (file:///usr/src/lodestar/packages/beacon-node/lib/network/peers/peerManager.js:340:31)

twoeths avatar Aug 06 '24 09:08 twoeths

issue was resolved because we reverted yamux enhancement

twoeths avatar Oct 01 '24 03:10 twoeths