lodestar icon indicating copy to clipboard operation
lodestar copied to clipboard

Lodestar stopped keeping up with head on mainnet

Open concentricspheres opened this issue 1 year ago • 32 comments

Describe the bug

Lodestar stopped keeping up with head on mainnet after block ~ 19972650

Expected behavior

To keep up with head

Steps to reproduce

No response

Additional context

consensus.log.gz

Operating system

Linux

Lodestar version or commit hash

v1.18.1/eefb711

concentricspheres avatar Jun 06 '24 16:06 concentricspheres

Two observations from the logs

  1. seems like your EL was still syncing before Lodestar started to fall behind
Jun-06 15:41:41.000[]                 [32minfo[39m: Synced - slot: 9238706 - head: 0x4901…b19e - exec-block: syncing(20033746 0x9174…) - finalized: 0x6724…7027:288707 - peers: 66
Jun-06 15:41:29.000[]                 [32minfo[39m: Synced - slot: 9238705 - head: 0xf8f2…6ddd - exec-block: syncing(20033745 0x538c…) - finalized: 0x6724…7027:288707 - peers: 62
Jun-06 15:41:17.000[]                 [32minfo[39m: Synced - slot: 9238704 - head: 0x165e…6111 - exec-block: syncing(20033744 0x40c5…) - finalized: 0x6724…7027:288707 - peers: 67
Jun-06 15:41:05.000[]                 [32minfo[39m: Synced - slot: 9238703 - head: 0x1a85…16df - exec-block: syncing(20033743 0x52c6…) - finalized: 0x6724…7027:288707 - peers: 68
  1. somehow the validator client is still submitting attestations even though your node was no longer synced, any chance you have multiple beaocon nodes running?

From the logs, it's hard to tell why it started to fall behind, it's not a general issue for this block as my node is still synced, might require debug logs to figure out what's going on.

If your node is still not back to head, you might wanna force a checkpoint sync it by passing

--checkpointSyncUrl <url> --forceCheckpointSync

nflaig avatar Jun 06 '24 17:06 nflaig

Hey @nflaig . We've had this happen on two nodes in our Lido clusters. Yes, we run multiple CL:EL, the VC is a Vouch.

Lodestar started syncing after we restarted it. We didn't want to wait so we checkpoint synced to get back on track.

So, this is definitely a thing, it happened on different blocks on two nodes.

yorickdowne avatar Jun 06 '24 20:06 yorickdowne

We are running Lodestar 1.16.0 and having the exact same issue. We will upgrade soon but it looks like this is still an issue on newer versions. This issue is fairly devastating for our use case as it increasing bridging latency. It also occurs silently with no error or warning in the logs.

For now, we have a script that compares Lodestar's finalized slot with the finalized slot on Beaconcha.in and restarts the node if it falls behind too far, but this obviously needs to be resolved on root cause.

claravanstaden avatar Jun 17 '24 09:06 claravanstaden

Hey all, when this happens next time, would you be able to capture and send the relevant debug logs for us to take a look at? Please also send us the hardware that you're running on to help augment those logs. We need to compile more information here on where to start digging since it seems intermittent and we don't have a clear indicator yet on what might be causing this. Any data would be helpful! Thanks.

philknows avatar Jun 18 '24 15:06 philknows

We have one node, lido3-c, with debug logs enabled. We're waiting for that one in particular to show the issue so we can get you debug logs.

We have since updated Lodestar to 1.19.0 as well, so it's possible we won't see the issue again

yorickdowne avatar Jun 18 '24 15:06 yorickdowne

We have one node, lido3-c, with debug logs enabled.

the default log level to file is debug, unless you have explicitly changed that all nodes will have those available

nflaig avatar Jun 18 '24 15:06 nflaig

Ah, OK. Thanks. I see beacon-timestamp.log in the datadir. We'll get you those when next we see a failure

yorickdowne avatar Jun 18 '24 15:06 yorickdowne

Thanks @philknows, I'll do the same. My file logging was set to info but I have changed it to debug.

claravanstaden avatar Jun 18 '24 17:06 claravanstaden

Possibly adjacent issue - one of our Lodestars would not catch up, Syncing but kept falling further behind. A restart fixed it.

Logs attached.

beacon-2024-06-21.log.gz

yorickdowne avatar Jun 21 '24 13:06 yorickdowne

@yorickdowne it looks like your EL may have been down during that time? During the Syncing part of the logs seeing lots of this error:

Batch process error id=Finalized, startEpoch=292046, status=Processing, code=BLOCK_ERROR_EXECUTION_ERROR, execStatus=UNAVAILABLE, errorMessage=Request to http://execution:8551 failed, reason: connect ECONNREFUSED 172.18.0.7:8551

which happens when we try to verify execution payloads of blocks that we're syncing.

Somewhat unrelated, one thing we're doing that could be improved, we throw away the batches of blocks we've downloaded when that error happens, and then we have to re-download the batches again and again. This isn't helpful as then we keep getting rate-limited and that causes us to really churn through peers.

wemeetagain avatar Jun 21 '24 14:06 wemeetagain

Yes I believe the EL had an issue. When restarting the stack, the EL did not shut down cleanly and was reaped after 5m timeout by Docker

yorickdowne avatar Jun 21 '24 15:06 yorickdowne

Hey all, when this happens next time, would you be able to capture and send the relevant debug logs for us to take a look at? Please also send us the hardware that you're running on to help augment those logs. We need to compile more information here on where to start digging since it seems intermittent and we don't have a clear indicator yet on what might be causing this. Any data would be helpful! Thanks.

@nflaig It happened again a few times in the last few days. I am attaching today's log. We have a script to check if Lodestar falls more than an epoch behind, then it restarts the node. The node was restarted at 06:21, so I would assume the issue is just before that.

beacon-2024-06-26.log.gz

claravanstaden avatar Jun 26 '24 08:06 claravanstaden

I have upgraded to Lodestar 1.19.0. Will see if the issue still occurs.

claravanstaden avatar Jun 27 '24 08:06 claravanstaden

@claravanstaden Thanks for the logs

The node was restarted at 06:21, so I would assume the issue is just before that.

Looking at the logs around the time the node was restarted, I don't see any sync issues, the last log right before is

Synced - slot: 9379860 - head: 0x8c74…355d - exec-block: syncing(20173993 0x7ddd…) - finalized: 0xdf2b…200b:293118 - peers: 12

The imported chain head seems to be the canonical block as well

New chain head slot=9379861, root=0xa301b49fd3c17df04f098b633b475d95a01d92a9906a1288d634e3bd73dc6b8a, delaySec=2.385999917984009

how is your script determining that the node is behind by one epoch?

Another observation from the logs is that the node is really struggling with keeping a high peer count.

nflaig avatar Jun 29 '24 09:06 nflaig

@yorickdowne The EL was down for quite some time and Lodestar is not importing blocks in that case causing it to fall behind. This behavior seems consistent with other clients. What would be concerning is if the node can't get back to head after EL is online again. I tried this a few times but the node is always able to get back to head quite quickly.

The logs are a bit confusing though as it prints -45000000000 seconds left - -8.25e-9 slots/s which does not make much sense, what we rather wanna log is something like Syncing - ? left - 0.00 slots/s indicating that node is not syncing.

And we are pretty silent about the EL being offline after the first error, we might wanna warn about this at least every epoch, this would make it clearer why its not keeping up with head.

nflaig avatar Jun 29 '24 10:06 nflaig

I think you have it: The issue here was that Lodestar did not catch back up when the EL was back online.

Because it’s so intermittent, it’s hard to track down. We haven’t seen it in a while - it’s entirely possible that “whatever it was” is now either resolved or much less likely since 1.19

yorickdowne avatar Jun 29 '24 10:06 yorickdowne

@nflaig I have the seen the same issue with Lodestar 1.19.0. Adding some logs here, with restart timestamps:

Jun 28 12:48 beacon-2024-06-28.log.gz Jun 29 16:12 beacon-2024-06-29.log.gz Jun 30 19:24 beacon-2024-06-30.log.gz

how is your script determining that the node is behind by one epoch?

I compare it with the finalized epoch on BeaconCha.in. I have changed the script to restart if it is more than 2 epochs behind. Maybe that will give a clearer indication. Perhaps we'll see less restarts then too. I can't increase the latency more, otherwise it will influence bridge transfer times.

Another observation from the logs is that the node is really struggling with keeping a high peer count.

Interesting, maybe that could be it?

claravanstaden avatar Jul 01 '24 09:07 claravanstaden

Adding some logs here, with restart timestamps:

I have looked at all three logs and right before the restart the node was synced, the only time I can see the node syncing is right after the restart which is expected.

How do you determine what's the finalized epoch in Lodestar? Maybe there is a bug somewhere else but the node is definitely not falling behind based on the logs, the head is correct before the restart.

Might be useful to track which epoch is flagged by your tooling to restart Lodestar, then we can look at that epoch specifically.

Interesting, maybe that could be it?

No the sync is still fine and the node is keeping up, just not seeing that on any of my nodes, peer count is usually between 100-105.

I am not sure how much impact it has if EL is reporting syncing on peering. But if you don't run the node with attached validators it's not that important ot have stable peers as long as node is running fine and keeping up.

nflaig avatar Jul 01 '24 09:07 nflaig

@nflaig thanks for the feedback. I'll keep an eye on it and let you know if I see it restarting after falling behind 2 epochs. Maybe the 1 epoch was too aggressive and it was restarting without there being a problem.

How do you determine what's the finalized epoch in Lodestar? Maybe there is a bug somewhere else but the node is definitely not falling behind based on the logs, the head is correct before the restart.

I am using this endpoint: curl http://localhost:9596/eth/v1/beacon/light_client/finality_update - using data.finalized_header.beacon.slot

Might be useful to track which epoch is flagged by your tooling to restart Lodestar, then we can look at that epoch specifically.

Will note this down if it happens again.

I am not sure how much impact it has if EL is reporting syncing on peering.

I am running without a EL node: --execution.engineMock --eth1 false

claravanstaden avatar Jul 01 '24 10:07 claravanstaden

@nflaig it happened again last night, with the changed setting to allow 2 epochs difference. Please let me know if you can see any issue.

1 July 2024 22:48 beacon-2024-07-01.log.gz

claravanstaden avatar Jul 02 '24 12:07 claravanstaden

@claravanstaden I still need to review our code more but my guess right now is that it's an issue with the light client server.

Looking at the timestamps of your restarts, this shows an interesting pattern

Jun 28 12:48 beacon-2024-06-28.log.gz Jun 29 16:12 beacon-2024-06-29.log.gz Jun 30 19:24 beacon-2024-06-30.log.gz Jul 1 22:48 beacon-2024-07-01.log.gz

Looks like the restart happens every ~27 hours which is the sync committee period. My suspicion is that we are not updating the finalized update in the light client server if sync committee changes.

Possible related to that, there are a lot of attestedData not available logs which are from the light client server as well.

with the changed setting to allow 2 epochs difference

If you set it to 3 epochs it would not happen I think, but anyways, restarting Lodestar is pointless here, if you wait 3 epochs it will report the correct finality update again from new sync committee period.

nflaig avatar Jul 03 '24 10:07 nflaig

@nflaig without the script, the actual block syncing of the Lodestar would fall behind until it was restarted again. Sometimes it would be days behind. So from what I have seen, its not just the light client server though. I could also change the script to use the standard beacon API instead of the light client endpoints. Let me do that.

claravanstaden avatar Jul 03 '24 12:07 claravanstaden

without the script, the actual block syncing of the Lodestar would fall behind until it was restarted again

This is not clear from the logs, the node was synced to head when it was restarted. It would be best to use head slot from /eth/v1/node/syncing to determine if node is syncing or not, and could check head_slot against an external source like beaconcha.in.

Looking at https://lodestar-mainnet.chainsafe.io/eth/v1/beacon/light_client/finality_update from our mainnet node, this seems to be correct and the node is running for weeks.

Sometimes it would be days behind. So from what I have seen, its not just the light client server though.

I am bit skeptical about this, I am running multiple Lodestar nodes myself for years, and the only time so far any of my nodes was falling behind is when my EL database got corrupted and EL went offline. Not saying it's impossible, but from my experience unlikely. Maybe related to engine mock (atlhough I don't see the correlation), but for use cases outside of development or testing, I would recommend to run a execution client.

I have opened an issue to track the light client server problem https://github.com/ChainSafe/lodestar/issues/6932

nflaig avatar Jul 03 '24 12:07 nflaig

@nflaig I have changed the script to use the main beacon API endpoint. It hasn't restarted the service in a week now, so I'd say it is safe to close this issue, thank you. It looks like the issue was resolved in Lodestar 1.19.0. Thanks again for the assistance!

claravanstaden avatar Jul 29 '24 06:07 claravanstaden

Thanks for testing, let's close this and reopen another issue if needed

nflaig avatar Jul 29 '24 07:07 nflaig

Hey @nflaig, I came across this issue again yesterday. I changed our restart script to compare https://beaconcha.in/api/v1/epoch/finalized with http://localhost:9596/eth/v1/beacon/states/finalized/finality_checkpoints (using data.finalized.epoch). The script only restarts the node if the diff is more than 4 epochs.

When the script restarted Lodestar, this was the state: Beaconcha.in Finalized Epoch: 300835 Lodestar Finalized Epoch: 300829

Please let me know if you see anything in the logs. Lodestar was restarted on Jul 30 13:24.

beacon-2024-07-30.log.gz

claravanstaden avatar Jul 31 '24 07:07 claravanstaden

I haven't looked into the logs yet but if you wanna use GET /eth/v1/beacon/states/{state_id}/finality_checkpoints to determine what's the currently finalized epoch you need to query by head and not finalized state id another rather check for 2 epochs diff.

Regarding the 4 diff, it depends on when you queried it and when Lodestar vs. Beaconcha.in updated the finalized epoch. It should happen roughly at the same time but might not be exactly the same.

I don't think in general it's a good approach to compare vs. external data, Lodestar itself knows if it's synced or not already so rather use GET /eth/v1/node/syncing by checking is_syncing=true or you can use the head_slot. For example, we have a grafana alert that fires if the head slot hasn't advanced for 5mins.

nflaig avatar Jul 31 '24 09:07 nflaig

I haven't looked into the logs yet but if you wanna use GET /eth/v1/beacon/states/{state_id}/finality_checkpoints to determine what's the currently finalized epoch you need to query by head and not finalized state id another rather check for 2 epochs diff.

This might be true but for 99% of the time, these two values are the same or the difference is 1 epoch.

Regarding the 4 diff, it depends on when you queried it and when Lodestar vs. Beaconcha.in updated the finalized epoch. It should happen roughly at the same time but might not be exactly the same.

6 epochs difference indicates a major gap, though, not just a timing issue.

I don't think in general it's a good approach to compare vs. external data, Lodestar itself knows if it's synced or not already so rather use GET /eth/v1/node/syncing by checking is_syncing=true or you can use the head_slot. For example, we have a grafana alert that fires if the head slot hasn't advanced for 5mins.

I disagree with this because the network should have the same view of what is the current finalized block. I can change this script to keep track of the head slot, but please let me know if you see anything in the logs that could have caused this.

claravanstaden avatar Jul 31 '24 09:07 claravanstaden

6 epochs difference indicates a major gap, though, not just a timing issue.

ah right it's 6 epochs, based on the logs your node had 2-3 peers for the most time and then at some point dropped down to 0 and was no longer able to sync, so that explains the disparity.

I disagree with this because the network should have the same view of what is the current finalized block

A node usually has ~100 peers so it has a pretty good view on the network state based on data from other peers.

It's bit hard to tell why your node is struggling so much to have stable peers, could be many reasons, like low bandwidth. What's notable from the logs is that is has 0 inbound peers, did you open port 9000 (udp/tcp)? Usually a healthy node will have mostly inbound peers after a while.

It would be good to collect some metrics to above a more broad view of how the node is performing over time.

nflaig avatar Jul 31 '24 10:07 nflaig

@nflaig Thanks for getting back to me. OK yeah I can look at port 9000, but I think what concerns me is why does Lodestar not find new peers to connect to, and that a restart fixes it?

claravanstaden avatar Jul 31 '24 12:07 claravanstaden