lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

error: "rtnetlink socket closed"

Open wjmelements opened this issue 1 year ago • 11 comments

Description

Lighthouse is not connecting to geth (v1.13.12) and floods the logs with error messages about the rtnetlink socket being closed

Version

stable v5.1.1

Present Behaviour

The logs is only this error, repeated

WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

It does not recover. Lighthouse only spams this log repeatedly.

Geth config:

--authrpc.addr localhost --authrpc.port 8551 --authrpc.jwtsecret /shared_ebs/geth_data/geth/jwtsecret

Lighthouse config:

--execution-endpoint http://localhost:8551 --execution-jwt /shared_ebs/geth_data/geth/jwtsecret

Expected Behaviour

Healthy node

Steps to resolve

Unknown

wjmelements avatar Mar 18 '24 20:03 wjmelements

Try killing Lighthouse with kill -9 or similar, then restart it.

I think this is a bug in libp2p where the listener gets interrupted by a shutdown signal during startup and then goes into an infinite loop. See:

  • https://github.com/libp2p/rust-libp2p/issues/3390

Related:

  • https://github.com/paritytech/polkadot-sdk/issues/671
  • https://github.com/libp2p/rust-libp2p/issues/4287

If Lighthouse doesn't restart after the kill there might be something else that triggered the initial shutdown, e.g. low disk space.

michaelsproul avatar Mar 21 '24 22:03 michaelsproul

then restart it

The issue happens again on restart. I am using systemd to manage the service which does sigkill if sigint doesn't stop it after long enough.

low disk space

The disk has 3TB of extra space.

I'm going to try to get the logs from when it starts happening next time, but it is challenging because the log-flood eventually cycles those logs.

wjmelements avatar Mar 21 '24 22:03 wjmelements

If you could DM some debug logs from $datadir/beacon/logs then I'll have a look at why it might be shutting down

michaelsproul avatar Mar 21 '24 23:03 michaelsproul

I can't glean much from the debug logs, it just shows Lighthouse eventually shutting down due to the SIGTERM and then the issue being resolved after the restart:

Mar 18 20:28:26.757 INFO Shutting down.., reason: Success("Received SIGTERM"), module: lighthouse:696

Mar 18 20:28:42.133 INFO Lighthouse started, version: Lighthouse/v5.1.1-2a3c709, module: lighthouse:621

The debug logs filled up with warnings pretty quickly, so the 5 files only covered:

  • lots of Listener error logs starting Mar 18 20:27:20.336
  • shutdown due to SIGTERM: Mar 18 20:28:26.757
  • errors continuing after SIGTERM and prior to Lighthouse starting again
  • Lighthouse starts, without errors: Mar 18 20:28:42.133

So there's only ~1 minute of history covered by the logs.

@AgeManning It looks like we expect some listener errors:

https://github.com/sigp/lighthouse/blob/5ce16192c769e4b593d7f953edcfb6e548c84167/beacon_node/lighthouse_network/src/service/mod.rs#L1793-L1801

I guess we can't change that event handler to shutdown immediately because some listener errors are non-fatal? e.g. listening on 0.0.0.0 causes LH to try listening on some random interface that doesn't work, if we have other interfaces we can safely ignore this sort of failure.

michaelsproul avatar Mar 21 '24 23:03 michaelsproul

It happened again but I don't have logs from when it started because it produces 1 million lines of errorlogs every 2 minutes and so my logs cycle out before I even notice it went down.

wjmelements avatar Mar 22 '24 19:03 wjmelements

@wjmelements Do you want to try modifying Lighthouse so it just crashes on the lines I linked above? If you replace the current warning by panic!() it will at least kill things quickly without spamming the logs and we might be able to see what's going on a bit better

michaelsproul avatar Mar 22 '24 21:03 michaelsproul

Yeah. So the reason we tolerate the listen errors is that generally libp2p will try and listen on a number of interfaces. i.e: ip4 tcp ip6 tcp ip4 udp (quic) ip6 udp (quic)

The host might not have ip6 support or some tcp port failure. Each individual listen error should not be a catastrophic failure, as we can fail on ip6 but succeed on ip4 (a common case).

I'll reach out to Michael and try and have a look at the logs. As of 5.1 we now store libp2p logs in the .lighthouse//beacon_node/logs directory. There might be useful information there.

AgeManning avatar Mar 24 '24 23:03 AgeManning

Also including @jxs who may be interested in this

AgeManning avatar Mar 24 '24 23:03 AgeManning

@wjmelements It might be helpful to check your system logs (e.g. /var/log/messages) for any errors related to network interfaces.

The rtnetlink socket closed error is emmitted by if-watch (linux.rs), a component that allows users to manage network interface events, such as bringing interfaces up or down. rtnetlink serves a mechanism for retrieving network information from the kernel, facilitated through a specific socket communucation channel between the kernel and user programs. This error could indicate that an issue has arisen within the socket communication to the kernel.

ackintosh avatar Mar 25 '24 05:03 ackintosh

I have logs from where it starts, on holesky:

Apr 08 18:17:06.001 INFO Synced                                  slot: 1391485, block:    …  empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x89e3…f847 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:14.353 INFO New block received                      root: 0x09d30077b578042c8ce3a2f1a39e5a191544dcf25f1aa45dfd719f2b1aab0236, slot: 1391486
Apr 08 18:17:18.001 INFO Synced                                  slot: 1391486, block: 0x09d3…0236, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:30.004 INFO Synced                                  slot: 1391487, block:    …  empty, epoch: 43483, finalized_epoch: 43481, finalized_root: 0x2b42…9fd0, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:42.316 INFO Synced                                  slot: 1391488, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:53.266 WARN Snapshot cache miss for blob verification, index: 4, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.348 WARN Snapshot cache miss for blob verification, index: 5, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.733 WARN Snapshot cache miss for blob verification, index: 1, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:53.823 WARN Snapshot cache miss for blob verification, index: 2, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:17:54.000 INFO Synced                                  slot: 1391489, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:17:55.251 WARN Snapshot cache miss for blob verification, index: 0, block_root: 0x08b5…65b1, service: beacon
Apr 08 18:18:51.433 INFO New block received                      root: 0x08b5a63f9c883498a8fc00628bb5bd0390d2715cbe068ffeea9c26fdabf365b1, slot: 1391489
Apr 08 18:19:26.923 INFO Synced                                  slot: 1391497, block:    …  empty, epoch: 43484, finalized_epoch: 43482, finalized_root: 0x5c57…7c7d, exec_hash: 0x330b…c8c4 (verified), peers: 19, service: slot_notifier
Apr 08 18:19:26.928 WARN Execution engine call failed            error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out), service: exec
Apr 08 18:19:26.932 WARN Error whilst processing payload status  error: Api { error: HttpClient(url: http://localhost:6551/, kind: timeout, detail: operation timed out) }, service: exec
Apr 08 18:19:27.820 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.825 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

after which the same log repeats, mostly Right with the occasional Left. The Left here is shown between two rights.

Apr 08 18:19:27.872 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p
Apr 08 18:19:27.924 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Left(Other(Left(Left(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })))) }) })) }, service: libp2p
Apr 08 18:19:27.934 WARN Listener error                          error: Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Transport(Custom { kind: Other, error: Right(Io(Custom { kind: BrokenPipe, error: "rtnetlink socket closed" })) }) })) }, service: libp2p

wjmelements avatar Apr 08 '24 19:04 wjmelements

FYI: I have submitted a PR to if-watch to improve the rtnetlink socket closed error message by including specific reasons. This should help us to understand why the error occurs.

https://github.com/mxinden/if-watch/pull/39

ackintosh avatar Apr 10 '24 03:04 ackintosh