reth
reth copied to clipboard
Lighthouse reporting corruption from reth
Describe the bug
It seems like every few days lighthouse and reth get out of communication with each other. In this scenario they are deployed in containers on the same local network with a shared filesystem for the jwt secret and a restart of both containers resolves the issue, which then comes back after 24-48 hours.
Steps to reproduce
Run lighthouse v5.1.2 with reth v0.2.0-beta.3 and the jwt secret as a shared file
Node logs
No response
Platform(s)
Linux (ARM)
What version/commit are you on?
0.2.0-beta.3
What database version are you on?
2
What type of node are you running?
Archive (default)
What prune config do you use, if any?
No response
If you've built Reth from source, provide the full command you used
No response
Code of Conduct
- [X] I agree to follow the Code of Conduct
do you still have the logs when this first occurred?
it looks like reth (incorrectly) classified a block as invalid and then refuses to accept blocks that build on top of that block
No, I'll get around to setting up log retention later this week though. Did resolve on restart and is syncing back up to the head now - but this seems to happen relatively frequently.
No, I'll get around to setting up log retention later this week though. Did resolve on restart and is syncing back up to the head now - but this seems to happen relatively frequently.
When reth runs into an error (which can be due to either logic or hardware) it usually reports an invalid block to the CL, which should also show up in logs. When the CL receives this response, it usually cannot continue. The result is that, minutes later, the CL logs look like:
Failed to sync chain built on an invalid parent
and because it stops sending requests to reth, the reth logs will look like:
Beacon client online, but no consensus updates received for a while....
BTW reth also outputs debug logs in ~/.cache/reth, I would check for any invalid block or other errors in there
So I'm running ECC ram and all the data is thrice replicated w scrubbing for bit flips via ceph, no packet loss on networks. This seems reth internal.
I wonder 1) is there i a liveness probe I could put in place to detect this 2) is there a linked/open/known issue here on why the errors are occurring
So I'm running ECC ram and all the data is thrice replicated w scrubbing for bit flips via ceph, no packet loss on networks. This seems reth internal.
yes it's likely this is caused by corruption or logic in reth
I wonder 1) is there i a liveness probe I could put in place to detect this
What would be your requirements for a liveness probe? Often if the latest block (obtainable in rpc) is not incrementing, that is a good indication that reth has run into an error.
- is there a linked/open/known issue here on why the errors are occurring
We don't know what the issue is yet because we don't have logs, so I would suggest checking or uploading the reth debug logs (please check ~/.cache for this) so we can investigate further. The information provided currently is not enough
A liveness probe would detect an outage (like this) and could be used to restart both containers and get things rolling again in an automated fashion if stuff gets wedged. It's very much a kludge, but could be as simple as a web method on the rpc reflecting if the node was healthy/unhealthy.
Around getting the log data, I'll setup a centralized log solution later, but will grab the debug info next time this occurs before killing the container (~/.cache is on ephemeral container storage/tmpfs)
Bumped to beta 4 and will see if it's still something I can reproduce.
@0xAlcibiades Any luck in reproducing?
This issue is stale because it has been open for 21 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.