lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

Validator Client intermittently freezes on Linux kernel `6.14.4` -> `6.14.7`

Open michaelsproul opened this issue 8 months ago • 14 comments

Summary

The Validator Client on Linux with kernel versions 6.14.4 -> 6.14.7 will intermittently freeze, stopping it from performing its duties.

If you're running a distro which closely follows Linux mainline (such as Arch Linux or Fedora) you may be affected.

Run uname -r to check if your kernel version is in the affected range.

Note: Ubuntu 24.04 and older use older kernels and thus are not affected. However, Ubuntu 25.04 is currently running the 6.14.0 kernel which is unaffected but it is possible that a future package upgrade will include one of the affected kernels so upgrade with caution.

Solutions and Workarounds

This bug, caused by changes to the eventpoll code, has already been patched in the Linux mainline kernel and will be fixed in 6.14.8+.

Once your distro allows you to update the kernel to 6.14.8 you can safely do so.

In the meantime, if you are running an affected kernel version you have a few options:

Install an LTS kernel

The procedure for this will differ depending on your distro but the below example is the instructions for Arch Linux:

sudo pacman -S linux-lts

If you are using systemd-boot, it should automatically generate the corresponding bootloader entries.

If you are using grub you will need to regenerate them:

sudo grub-mkconfig -o /boot/grub/grub.cfg

Reboot and you should see linux-lts included in the grub menu.

Downgrade your kernel

This will vary depending on your distro and for some distros it is very involved. Here are the instructions for Fedora 41:

# Find available kernels
sudo dnf list kernel --showduplicates

# Install a specific kernel. For example:
sudo dnf install kernel-6.11.4-301.fc41

grub entries will automatically be added, so reboot and select the new kernel from the list.

Run an API polling script

If you do not want to touch your kernel in case you break something there is a simple bash script you can run instead.

Due to the internals of eventpoll, when the VC receives an API call, it will wake from its freeze.

Because of this we can use a script running in the background which continuously polls the VC. Here is an example of such a script:

while sleep 5; do curl -s --fail "http://localhost:5062/lighthouse/auth" > /dev/null && echo "polled at $(date)"; done

This will keep the VC awake. Note that running a full VC metrics server with Grafana polling the VC will also keep it awake for the same reason.

Acknowledgments

A huge thank you to the users on Discord who discovered this issue and assisted in diagnosis and testing, particularly @smooth.ninja and @ChosunOne.


See https://github.com/tokio-rs/tokio/issues/7335 and #7403 for more details

michaelsproul avatar May 06 '25 06:05 michaelsproul

Can i work on this?

0xriazaka avatar May 06 '25 15:05 0xriazaka

@0xriazaka If you can work out the root cause, please try. We won't assign it exclusively to you because we need to fix this ASAP.

michaelsproul avatar May 06 '25 23:05 michaelsproul

So far 3 of 3 confirmed cases occurred on Arch Linux.

I suspect it's something to do with the new kernel version.

michaelsproul avatar May 07 '25 01:05 michaelsproul

I, too, am experiencing this on archlinux. Running strace during the hang yields: futex(0x7928d3c71910, FUTEX_WAIT_BITSET_PRIVATE, 0, NULL, FUTEX_BITSET_MATCH_ANY

An interesting observation is maybe the following: I am running two validator processes, one with one key and another one with two active validators. Only the one with two validators hangs every couple of hours.

Let me know if I can be of any help reproducing the issue.

j4cko avatar May 10 '25 07:05 j4cko

@j4cko Please try the work around polling the http api of the VC.

We might have a build to share soon

michaelsproul avatar May 10 '25 09:05 michaelsproul

Something else we could try:

  • https://docs.rs/parking_lot/latest/parking_lot/deadlock/fn.check_deadlock.html

We could run a background thread in parking_lot that checks periodically for deadlocks. I think all we can do if we detect one is print out the thread IDs and the backtraces. Should probably run with debug symbols in order to get the best backtraces.

michaelsproul avatar May 13 '25 05:05 michaelsproul

Can confirm this issue began for me when updating to Linux 6.14.4-arch1-1 x86_64 No issue with v7.0.0 on the old kernel but updated both kernel and to v7.0.1 and the issue started.

Can't guarantee this is related, but thought I'd mention it in case it helps narrow the issue down. I'm currently participating in the Aztec Public Testnet and running a Sepolia node using Lighthouse for my beacon node, and the Aztec node occasionally hangs in a similar way to the VC. It just runs for a bit and then freezes after a few hours. If related, makes me suspect the beacon node.

keccakk avatar May 14 '25 13:05 keccakk

I'm having this issue as well with Gnosis mainnet. It only started happening after v7.0.x with Nethermind as the EL client. I'm also on Linux 6.14.4-arch1-2 x86_64. Probing the VC with a http requests wakes it up

emilbayes avatar May 15 '25 10:05 emilbayes

@emilbayes Please try updating the kernel to 6.14.6. We've got a smaller program than Lighthouse VC which reproduces the hang, which just uses some mutexes and some sleeps and it hangs in the same way on 6.14.{4,5} but not yet on 6.14.6. The underlying issue seems to be an incompatibility between Tokio and the kernel, or a kernel bug, nothing Lighthouse-specific.

michaelsproul avatar May 15 '25 13:05 michaelsproul

@michaelsproul I just got another hang on 6.14.6, started working again when I started the curl cmd up again.

keccakk avatar May 15 '25 19:05 keccakk

@keccakk Thanks for the info! We hadn't been running very long on 6.14.6,

We will keep trying to isolate the conditions under which it happens. We're a bit out of our depth when it comes to fixing it, so it might be a while until this is resolved. Planning a bug report to the tokio devs and then maybe trying to get someone with kernel knowledge interested.

In the meantime please keep running the workaround (polling the API with a script), or try downgrading the kernel to LTS. We have some Arch machines on the LTS kernel that haven't had issues at all (over several days).

michaelsproul avatar May 15 '25 23:05 michaelsproul

Hello everyone, I've raised an issue on the Tokio repo: https://github.com/tokio-rs/tokio/issues/7335

I managed to reproduce the issue with a very simple program: https://github.com/macladson/tokio-lock-example This is helpful because you won't need any validator keys to test it.

This confirms it is not a Lighthouse specific issue.

In the meantime, either use the API polling trick, OR use the Arch linux-lts kernel. Both will prevent the issue.

Another thing you can try is limiting the VC to only 2 threads. I am yet to reproduce the issue when running that way. You can either use cgroups in systemd or run the VC like taskset -c 0,1 lighthouse vc .... If you still get a freeze like this, please let us know.

If anyone finds anything else, please post here

macladson avatar May 16 '25 10:05 macladson

I've just pinned this issue and updated the description to explain our current findings and solutions. 6.14.8 will likely be released for Arch Linux in next couple of days so this issue should become less of a problem soon. We will have to keep monitoring to see if it ends up affecting any of the major Ubuntu releases but my hunch is that they will likely skip these versions.

macladson avatar May 26 '25 09:05 macladson

Just a PSA that 6.14.9 was released on Arch Linux today which should fix the issue (I guess they skipped 6.14.8)

macladson avatar May 31 '25 12:05 macladson

I've unpinned this, as most users are no longer affected (have upgraded to a new kernel), and this is unlikely to be an issue for stable distros

michaelsproul avatar Jun 23 '25 01:06 michaelsproul

I'm having this issue with kernel 6.5.11-6-pve (from the Debian host, lighthouse runs in a container)

Added the poller via systemd

[Unit]
Description=Poll Lighthouse auth endpoint every 5 s
# Start *after* the validator and stop with it
After    = lighthouse-vc.service
Requires = lighthouse-vc.service

[Service]
User    = ubuntu
Type    = simple
Restart = always
RestartSec = 5

# Poll every 5 s; log a heartbeat when the endpoint is reachable
ExecStart = /bin/sh -c 'while true; do \
    curl -s --fail http://localhost:5062/lighthouse/auth >/dev/null && \
    echo "polled at $(date)"; \
    sleep 5; \
done'

SyslogIdentifier = lighthouse-poller

[Install]
WantedBy = multi-user.target

gaia avatar Jul 21 '25 07:07 gaia

I'm having this issue with kernel 6.5.11-6-pve (from the Debian host, lighthouse runs in a container)

Added the poller via systemd

[Unit]
Description=Poll Lighthouse auth endpoint every 5 s
# Start *after* the validator and stop with it
After    = lighthouse-vc.service
Requires = lighthouse-vc.service

[Service]
User    = ubuntu
Type    = simple
Restart = always
RestartSec = 5

# Poll every 5 s; log a heartbeat when the endpoint is reachable
ExecStart = /bin/sh -c 'while true; do \
    curl -s --fail http://localhost:5062/lighthouse/auth >/dev/null && \
    echo "polled at $(date)"; \
    sleep 5; \
done'

SyslogIdentifier = lighthouse-poller

[Install]
WantedBy = multi-user.target

Let us know if the sleep/freeze still happening with the script. If it happens, usually within 24 hours you will see it

FWIW all the reported cases are occurring in the VC, BN is ok because the VC will call the BN frequently, thus "waking it up". Your case is the opposite, which might suggest something else

chong-he avatar Jul 21 '25 08:07 chong-he

Let us know if the sleep/freeze still happening with the script. If it happens, usually within 24 hours you will see it

FWIW all the reported cases are occurring in the VC, BN is ok because the VC will call the BN frequently, thus "waking it up". Your case is the opposite, which might suggest something else

It is the BN that hangs. Look at the logs

Image
Jul 20 05:40:06 ethval lighthouse-vc[314]: Jul 20 04:15:01.509 INFO  Connected to beacon node(s)                   primary: "http://localhost:5052/", total: 1, available: 1, synced: 1
Jul 20 05:40:06 ethval lighthouse-vc[314]: Jul 20 04:25:54.813 ERROR Unable to read spec from beacon node          error: HttpClient(url: http://localhost:5052/, kind: timeout, detail: operation timed out), endpoint: http://localhost:5052/
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:38:02.478 WARN  A connected beacon node errored during routine health check  error: Offline, endpoint: http://localhost:5052/
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:40:06.327 INFO  Awaiting activation                           validators: X, epoch: 380640, slot: 12180498
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:40:06.369 ERROR Failed to download proposer duties            err: Some endpoints failed, num_failed: 2 http://localhost:5052/ => RequestFailed(HttpClient(url: http://localhost:5052/, kind: timeout, detail: operation timed out>
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:40:06.375 ERROR Failed to download attester duties            current_epoch: 380616, request_epoch: 380616, err: FailedToDownloadAttesters("Some endpoints failed, num_failed: 2 http://localhost:5052/ => RequestFailed(HttpClien>
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:42:14.231 ERROR No synced beacon nodes                        total: 1, available: 0, synced: 0
Jul 20 06:06:43 ethval systemd[1]: lighthouse-vc.service: Main process exited, code=killed, status=9/KILL
Jul 20 06:06:43 ethval systemd[1]: lighthouse-vc.service: Failed with result 'signal'.
Jul 20 06:06:43 ethval systemd[1]: lighthouse-vc.service: Consumed 3h 41min 25.084s CPU time.
Jul 20 06:06:53 ethval systemd[1]: lighthouse-vc.service: Scheduled restart job, restart counter is at 1.

gaia avatar Jul 21 '25 15:07 gaia

Let us know if the sleep/freeze still happening with the script. If it happens, usually within 24 hours you will see it FWIW all the reported cases are occurring in the VC, BN is ok because the VC will call the BN frequently, thus "waking it up". Your case is the opposite, which might suggest something else

It is the BN that hangs. Look at the logs Image

Jul 20 05:40:06 ethval lighthouse-vc[314]: Jul 20 04:15:01.509 INFO  Connected to beacon node(s)                   primary: "http://localhost:5052/", total: 1, available: 1, synced: 1
Jul 20 05:40:06 ethval lighthouse-vc[314]: Jul 20 04:25:54.813 ERROR Unable to read spec from beacon node          error: HttpClient(url: http://localhost:5052/, kind: timeout, detail: operation timed out), endpoint: http://localhost:5052/
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:38:02.478 WARN  A connected beacon node errored during routine health check  error: Offline, endpoint: http://localhost:5052/
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:40:06.327 INFO  Awaiting activation                           validators: X, epoch: 380640, slot: 12180498
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:40:06.369 ERROR Failed to download proposer duties            err: Some endpoints failed, num_failed: 2 http://localhost:5052/ => RequestFailed(HttpClient(url: http://localhost:5052/, kind: timeout, detail: operation timed out>
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:40:06.375 ERROR Failed to download attester duties            current_epoch: 380616, request_epoch: 380616, err: FailedToDownloadAttesters("Some endpoints failed, num_failed: 2 http://localhost:5052/ => RequestFailed(HttpClien>
Jul 20 06:06:39 ethval lighthouse-vc[314]: Jul 20 05:42:14.231 ERROR No synced beacon nodes                        total: 1, available: 0, synced: 0
Jul 20 06:06:43 ethval systemd[1]: lighthouse-vc.service: Main process exited, code=killed, status=9/KILL
Jul 20 06:06:43 ethval systemd[1]: lighthouse-vc.service: Failed with result 'signal'.
Jul 20 06:06:43 ethval systemd[1]: lighthouse-vc.service: Consumed 3h 41min 25.084s CPU time.
Jul 20 06:06:53 ethval systemd[1]: lighthouse-vc.service: Scheduled restart job, restart counter is at 1.

yeah, just saying that all the cases reported here is the VC that froze. So your case is a bit different here. Do you still see the froze after running the helper script?

chong-he avatar Jul 21 '25 23:07 chong-he

I think the helper script might need to be tweaked a little to poll http://localhost:5052/eth/v1/node/syncing rather than http://localhost:5062/lighthouse/auth.

michaelsproul avatar Jul 22 '25 00:07 michaelsproul

Just wanted to add that this is very likely a fundamentally different issue to the one we saw on the VC since that only affected kernels starting from 6.14.4 and you're running 6.5.11 which will not have the epoll regression which caused the VC bug. Also I doubt the polling script will even help in this case because even if it is the same class of bug as before, as @chong-he mentioned, the VC already polls the BN frequently.

@gaia if you could open a new issue detailing your system and setup we can investigate further.

macladson avatar Jul 22 '25 08:07 macladson