reth icon indicating copy to clipboard operation
reth copied to clipboard

Lighthouse reporting corruption from reth

Open 0xAlcibiades opened this issue 1 year ago • 11 comments

Describe the bug

Screenshot 2024-03-27 at 12 49 53 Screenshot 2024-03-27 at 12 50 09

It seems like every few days lighthouse and reth get out of communication with each other. In this scenario they are deployed in containers on the same local network with a shared filesystem for the jwt secret and a restart of both containers resolves the issue, which then comes back after 24-48 hours.

Steps to reproduce

Run lighthouse v5.1.2 with reth v0.2.0-beta.3 and the jwt secret as a shared file

Node logs

No response

Platform(s)

Linux (ARM)

What version/commit are you on?

0.2.0-beta.3

What database version are you on?

2

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

  • [X] I agree to follow the Code of Conduct

0xAlcibiades avatar Mar 27 '24 16:03 0xAlcibiades

do you still have the logs when this first occurred?

it looks like reth (incorrectly) classified a block as invalid and then refuses to accept blocks that build on top of that block

mattsse avatar Mar 27 '24 17:03 mattsse

No, I'll get around to setting up log retention later this week though. Did resolve on restart and is syncing back up to the head now - but this seems to happen relatively frequently.

0xAlcibiades avatar Mar 27 '24 17:03 0xAlcibiades

No, I'll get around to setting up log retention later this week though. Did resolve on restart and is syncing back up to the head now - but this seems to happen relatively frequently.

When reth runs into an error (which can be due to either logic or hardware) it usually reports an invalid block to the CL, which should also show up in logs. When the CL receives this response, it usually cannot continue. The result is that, minutes later, the CL logs look like:

Failed to sync chain built on an invalid parent

and because it stops sending requests to reth, the reth logs will look like:

Beacon client online, but no consensus updates received for a while....

BTW reth also outputs debug logs in ~/.cache/reth, I would check for any invalid block or other errors in there

Rjected avatar Mar 27 '24 18:03 Rjected

So I'm running ECC ram and all the data is thrice replicated w scrubbing for bit flips via ceph, no packet loss on networks. This seems reth internal.

0xAlcibiades avatar Mar 27 '24 18:03 0xAlcibiades

I wonder 1) is there i a liveness probe I could put in place to detect this 2) is there a linked/open/known issue here on why the errors are occurring

0xAlcibiades avatar Mar 27 '24 18:03 0xAlcibiades

So I'm running ECC ram and all the data is thrice replicated w scrubbing for bit flips via ceph, no packet loss on networks. This seems reth internal.

yes it's likely this is caused by corruption or logic in reth

I wonder 1) is there i a liveness probe I could put in place to detect this

What would be your requirements for a liveness probe? Often if the latest block (obtainable in rpc) is not incrementing, that is a good indication that reth has run into an error.

  1. is there a linked/open/known issue here on why the errors are occurring

We don't know what the issue is yet because we don't have logs, so I would suggest checking or uploading the reth debug logs (please check ~/.cache for this) so we can investigate further. The information provided currently is not enough

Rjected avatar Mar 27 '24 18:03 Rjected

A liveness probe would detect an outage (like this) and could be used to restart both containers and get things rolling again in an automated fashion if stuff gets wedged. It's very much a kludge, but could be as simple as a web method on the rpc reflecting if the node was healthy/unhealthy.

0xAlcibiades avatar Mar 27 '24 18:03 0xAlcibiades

Around getting the log data, I'll setup a centralized log solution later, but will grab the debug info next time this occurs before killing the container (~/.cache is on ephemeral container storage/tmpfs)

0xAlcibiades avatar Mar 27 '24 19:03 0xAlcibiades

Bumped to beta 4 and will see if it's still something I can reproduce.

0xAlcibiades avatar Mar 28 '24 00:03 0xAlcibiades

@0xAlcibiades Any luck in reproducing?

onbjerg avatar Apr 22 '24 13:04 onbjerg

This issue is stale because it has been open for 21 days with no activity.

github-actions[bot] avatar May 14 '24 01:05 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jun 02 '24 01:06 github-actions[bot]