reth "Error: failed to open the database: unknown error code: 11 (11)" after restart and zombie process

trafficstars

Describe the bug

op-reth ended up with this in logs. When tried to restart via docker, it says it's zombie and I was unable to restart. So tried redeploy and ended up with corrupted database and failed deployment - see logs

op-reth: 1.1.2

Steps to reproduce

no idea how to reproduce. The node might be under a heavy load, so some thread could be lost/stuck and then it ended up like this.

Node logs

logs prior restart
`
2024-11-25T08:29:33.115935Z  INFO Canonical chain committed number=22851179 hash=0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762 elapsed=193.247µs
2024-11-25T08:29:35.065377Z  INFO Status connected_peers=59 latest_block=22851179
2024-11-25T08:29:58.786640Z  INFO Block added to canonical chain number=22851180 hash=0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b peers=59 txs=291 gas=44.07 Mgas gas_throughput=1.72 Mgas/second full=24.5% base_fee=0.04gwei blobs=0 excess_blobs=0 elapsed=25.663442107s
2024-11-25T08:29:58.867764Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b) })
2024-11-25T08:29:58.870594Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b) })
2024-11-25T08:29:58.873429Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b) })
2024-11-25T08:29:58.873449Z ERROR Failed to send event: Ok(OnForkChoiceUpdated { forkchoice_status: Valid, fut: Left(Ready(Some(Ok(ForkchoiceUpdated { payload_status: PayloadStatus { status: Valid, latest_valid_hash: Some(0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762) }, payload_id: None })))) })
2024-11-25T08:29:58.873460Z ERROR Failed to send event: Ok(OnForkChoiceUpdated { forkchoice_status: Valid, fut: Left(Ready(Some(Ok(ForkchoiceUpdated { payload_status: PayloadStatus { status: Valid, latest_valid_hash: Some(0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762) }, payload_id: None })))) })
2024-11-25T08:29:58.874358Z  INFO New payload job created id=0x038f68f431015618 parent=0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762
2024-11-25T08:29:58.881917Z  INFO Canonical chain committed number=22851180 hash=0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b elapsed=212.653µs
2024-11-25T08:30:00.065854Z  INFO Status connected_peers=59 latest_block=22851180
2024-11-25T08:30:25.065833Z  INFO Status connected_peers=59 latest_block=22851180
2024-11-25T08:30:35.652512Z  INFO Block added to canonical chain number=22851181 hash=0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1 peers=59 txs=329 gas=31.60 Mgas gas_throughput=1.19 Mgas/second full=17.6% base_fee=0.04gwei blobs=0 excess_blobs=0 elapsed=26.55651455s
2024-11-25T08:30:35.729705Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.733282Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.736579Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.738976Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.741338Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.742247Z  INFO New payload job created id=0x034a6bd27fd3097a parent=0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b
2024-11-25T08:30:35.747152Z  INFO Canonical chain committed number=22851181 hash=0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1 elapsed=249.103µs
2024-11-25T08:30:50.065957Z  INFO Status connected_peers=59 latest_block=22851181
`

after redeployment
`
2024-11-25T09:13:07.146101Z  INFO Initialized tracing, debug log directory: /root/.cache/reth/logs/base
2024-11-25T09:13:07.148752Z  INFO Starting reth version="1.1.2 (496bf0bf)"
2024-11-25T09:13:07.148775Z  INFO Opening database path="/data/db"
2024-11-25T09:13:07.149477Z ERROR shutting down due to error
Error: failed to open the database: unknown error code: 11 (11)

Location:
    /project/crates/storage/db/src/mdbx.rs:28:8
`

Platform(s)

Linux (x86)

What version/commit are you on?

reth-optimism-cli Version: 1.1.2
Commit SHA: 496bf0bf715f0a1fafc198f8d72ccd71913d1a40
Build Timestamp: 2024-11-19T10:54:10.868121467Z
Build Features: asm_keccak,jemalloc,optimism
Build Profile: maxperf

What database version are you on?

Current database version: 2
Local database version: 2

Which chain / network are you on?

base mainnet archive

What type of node are you running?

Archive (default)

What prune config do you use, if any?

No response

If you've built Reth from source, provide the full command you used

No response

Code of Conduct

[x] I agree to follow the Code of Conduct

Nov 25 '24 09:11 tmeinlschmidt

I have a similar issue and traced it down to self.txn.txn_execute(|_| unsafe { ffi::mdbx_cursor_close(self.cursor) }).unwrap() in libmdbx-rs. Might not be the same root cause though, I'm not sure.

Another similar occurrence I found in the reth group https://t.me/paradigm_reth/27096

Nov 28 '24 01:11 tch1001

same , v0.11.0-reth this occurs after reth stuck for a while then restart

Dec 19 '24 23:12 jun0tpyrc

can't find this line of code in latest version @tch1001 . is this still an issue on latest version @tmeinlschmidt ?

Jan 22 '25 03:01 emhane

I was running a reth node, stopped it, started a manual prune, interrupted the prune midway, and restarting the prune fails with the same error:

$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth prune -vvv
2025-02-25T10:26:02.175337Z  INFO Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-02-25T10:26:02.178849Z  INFO Opening storage db_path="/root/.local/share/reth/mainnet/db" sf_path="/root/.local/share/reth/mainnet/static_files"
2025-02-25T10:26:02.225099Z  INFO Verifying storage consistency.
2025-02-25T10:26:02.299224Z  INFO Copying data from database to static files...
2025-02-25T10:26:02.301585Z  INFO Copied data from database to static files lowest_static_file_height=Some(21456689)
2025-02-25T10:26:02.301597Z  INFO Pruning data from database... prune_tip=21456689 prune_config=PruneConfig { block_interval: 5, segments: PruneModes { sender_recovery: Some(Full), transaction_lookup: Some(Full), receipts: Some(Distance(10064)), account_history: Some(Distance(10064)), storage_history: Some(Distance(10064)), receipts_log_filter: ReceiptsLogPruneConfig({}) } }
^C
$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth prune -vvvvv
2025-02-25T10:28:35.345733Z  INFO reth::cli: Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-02-25T10:28:35.349731Z  INFO reth::cli: Opening storage db_path="/root/.local/share/reth/mainnet/db" sf_path="/root/.local/share/reth/mainnet/static_files"
Error: failed to open the database: unknown error code: 11 (11)

Location:
    /project/crates/storage/db/src/mdbx.rs:28:8

$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth --version
reth Version: 1.2.0
Commit SHA: 1e0b0d897b372226b3f0ebf911b5176132c322d7
Build Timestamp: 2025-02-12T16:48:04.304586958Z
Build Features: asm_keccak,jemalloc
Build Profile: maxperf

Feb 25 '25 10:02 0xmichalis

Uh, I just restarted the manual prune and now it works fine(?)

$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth prune -vvvvvv
2025-02-25T11:14:04.354832Z  INFO reth::cli: Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-02-25T11:14:04.368149Z  INFO reth::cli: Opening storage db_path="/root/.local/share/reth/mainnet/db" sf_path="/root/.local/share/reth/mainnet/static_files"
2025-02-25T11:14:04.412020Z  INFO reth::cli: Verifying storage consistency.
2025-02-25T11:14:04.490016Z DEBUG reth::cli: Initializing genesis chain=mainnet genesis=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3
2025-02-25T11:14:04.492738Z DEBUG reth_db_common::init: Genesis already written, skipping.
2025-02-25T11:14:04.492744Z  INFO reth::cli: Copying data from database to static files...
2025-02-25T11:14:04.493881Z DEBUG static_file: StaticFileProducer started targets=StaticFileTargets { headers: None, receipts: None, transactions: None, block_meta: Some(0..=21456689) }
2025-02-25T11:14:04.496052Z DEBUG static_file: StaticFileProducer finished targets=StaticFileTargets { headers: None, receipts: None, transactions: None, block_meta: Some(0..=21456689) } elapsed=2.133375ms
2025-02-25T11:14:04.496096Z  INFO reth::cli: Copied data from database to static files lowest_static_file_height=Some(21456689)
2025-02-25T11:14:04.496100Z  INFO reth::cli: Pruning data from database... prune_tip=21456689 prune_config=PruneConfig { block_interval: 5, segments: PruneModes { sender_recovery: Some(Full), transaction_lookup: Some(Full), receipts: Some(Distance(10064)), account_history: Some(Distance(10064)), storage_history: Some(Distance(10064)), receipts_log_filter: ReceiptsLogPruneConfig({}) } }
2025-02-25T11:14:04.497451Z DEBUG pruner: Pruner started tip_block_number=21456689
2025-02-25T11:14:04.498218Z DEBUG pruner: Segment pruning started segment=Headers purpose=StaticFile to_block=21456689 prune_mode=Before(21456690)
2025-02-25T11:14:04.499734Z DEBUG pruner: Segment pruning finished segment=Headers purpose=StaticFile to_block=21456689 prune_mode=Before(21456690) segment_output.pruned=0
2025-02-25T11:14:04.500239Z DEBUG pruner: Segment pruning started segment=Transactions purpose=StaticFile to_block=21456689 prune_mode=Before(21456690)
2025-02-25T11:14:04.500248Z DEBUG pruner: Segment pruning finished segment=Transactions purpose=StaticFile to_block=21456689 prune_mode=Before(21456690) segment_output.pruned=0
2025-02-25T11:14:04.500624Z DEBUG pruner: Segment pruning started segment=Receipts purpose=StaticFile to_block=0 prune_mode=Before(1)
2025-02-25T11:14:04.501081Z DEBUG pruner: Segment pruning finished segment=Receipts purpose=StaticFile to_block=0 prune_mode=Before(1) segment_output.pruned=0
2025-02-25T11:14:04.501490Z DEBUG pruner: Segment pruning started segment=AccountHistory purpose=User to_block=21446625 prune_mode=Distance(10064)
2025-02-25T11:14:20.657722Z DEBUG pruner: Segment pruning finished segment=AccountHistory purpose=User to_block=21446625 prune_mode=Distance(10064) segment_output.pruned=1686478
2025-02-25T11:14:20.657742Z DEBUG pruner: Segment pruning started segment=StorageHistory purpose=User to_block=21446625 prune_mode=Distance(10064)

Feb 25 '25 11:02 0xmichalis

Hm, seem a really stupid and/or bad rust-bindings are used for libmdbx (key-value storage engine), otherwise you was not get an "unknown error code: 11".

This is actually the EAGAIN (on Linux) system error and libmdbx provides mdbx_strerror_r() to getting corresponding string/message by a given error code.

Recently I have commented Isar' issue https://github.com/isar/isar/issues/1068#issuecomment-2726769493. Please ref to it for reasons description and workaround.

Mar 22 '25 22:03 erthink

@erthink This problem is common in Docker environments, while it's not easy to encounter in the host environment. I've also tried vorot93/libmdbx-rs, but there are similar problems, except it's not 'unknown error code 11' but 'Resource temporarily unavailable'. Can retrying solve the problem? It seems not to work in my case

Apr 28 '25 09:04 0x8f701

@erthink This problem is common in Docker environments, while it's not easy to encounter in the host environment. I've also tried vorot93/libmdbx-rs, but there are similar problems, except it's not 'unknown error code 11' but 'Resource temporarily unavailable'. Can retrying solve the problem? It seems not to work in my case

In my current understanding, the cause is within Reth, not MDBX, since Erigon don't have such issues (AFAIK). However, I don't see the whole picture and I have neither the time nor the desire to delve into it, as Reth predisposed to refuses help/advice/cooperate, rather than accept it (at least my appeals and advice are ignored before mostly).

Nonetheless, there are two completely different reasons for making a EAGAIN error:

Lack of resources or exceeding quotas.
A some issues with locking and/or sharing access when opening the database.

So my advices will as follows:

Read the Containers and check/turn your case to be conform for a requirements.
Provide strace log using logged-enabled (debug) build of libmdbx:
- build libmdbx with logging/debug enabled by MDBX_DEBUG=1 option;
- setup a logger-function and MDBX_LOG_DEBUG log-vevel by mdbx_setup_debug();
- run Reth under strace and reproduce the issue;
- show the log and/or send it to me (telegram is preferred).

Apr 28 '25 12:04 erthink

fyi i believe disk-perf > single-core-cpu-perf might being your bottleneck then it's easier to run into that corruption after some io/call making deadlocks etc

i run reth-archive nodes for the most demanding l2 evm chains like base by docker-compose, struggled for weeks but now all good

Need the best disks nvme-raid0 on filesystem without compression, also may consider seccomp:unconfined for performance boost

This is whole irrelevant at all.

Apr 28 '25 12:04 erthink

However, it is obvious that in case of failure or shutdown, it is necessary to wait for the container with Reth to fully terminate/finalyze before launching a new one (especially if Reth uses an exclusive MDBX mode, but I have no information about this).

Apr 28 '25 12:04 erthink

@erthink FYI, I tried to switch libmdbx's version to v0.13.6 but this error still happens while creating an env. Probably it's just an issue on docker

reth version

osal_lck_seize:34987 lock-against-without-lck, err 11

v0.13.6

lck_seize:24674 lock-against-without-lck, err 11
mdbx_env_open:9815 error 11 (Resource temporarily unavailable)

ref: https://github.com/erigontech/erigon/issues/8552

Although this is not a problem that frequently occurs on host machines, it significantly impacts the use of libmdbx in Docker, especially in test environments where we need to start numerous services, which makes this issue particularly challenging. Libmdbx is the only key-value store I know that offers good support for multiple processes. We are temporarily unable to switch to other key-value stores for testing.

Apr 30 '25 01:04 0x8f701

@0x8f701, thank for information.

However, it very much seems that it's not about transients during restarts, etc., but that Reth uses MDBX_EXCLUSIVE mode for DB and thus just prevents opening a DB from other process(es)/container(s).

Nonetheless I've prepared a rough patch that just does a few repetitions of the corresponding lock-taking step. Please checkout the release-engineering branch on Gitflic.

If this helps solve the problem, then I will finalize this fix and add it to the stable branch (based on which libmdbx v0.13.7 will be released at the end of May). Otherwise we should dig deeper.

Apr 30 '25 07:04 erthink

@erthink thanks! btw I didn't use libmdbx in reth. I use it in our own multiple processes case where there are one writer and multiple readers. So definitely not EXCLUSIVE.

I'll try that branch soon abd let you know

Apr 30 '25 14:04 0x8f701

I use it in our own multiple processes case where there are one writer and multiple readers. So definitely not EXCLUSIVE.

@0x8f701,

There may be additional difficulties in spawn scenarios using libmdbx. Basically the mdbx_env_resurrect_after_fork() or a DB should be opened after the fork().

However, a basic/general multi-process cases are much checked, since uses as a basis for libmdbx' test framework. Thus, if you encounter an EAGAIN error in such cases, then most likely there is some special particular reason for this.

Therefore, I doubt that my trial/experimental patch will help. I think it will be necessary to look at the whole picture and analyze everything that happens with a DB, including the activity of all containers and processes working with it.

Apr 30 '25 14:04 erthink

hi @erthink I tried the new patch, I didn't see unknown error code 11 anymore but I got another error: Another write transaction

May 07 '25 02:05 0x8f701

I'm encountering a similar error when trying to launch Reth. I am attempting to launch Reth in Docker on Ubuntu 24.04 (as part of an Ansible Molecule test).

Docker image: jrei/systemd-ubuntu:24.04 Host OS: macOS Sequoia 15.1.1 (ARM)

Steps to reproduce:

docker run -it jrei/systemd-ubuntu:24.04 /bin/bash
wget https://github.com/paradigmxyz/reth/releases/download/v1.3.12/reth-v1.3.12-x86_64-unknown-linux-gnu.tar.gz
tar -xvzf reth-v1.3.12-x86_64-unknown-linux-gnu.tar.gz
./reth node

Logs:

root@reth-ubuntu24:~# reth node
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
2025-05-10T11:40:59.510019Z  INFO Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-05-10T11:40:59.524539Z  INFO Starting reth version="1.3.12 (6f8e725)"
2025-05-10T11:40:59.525333Z  INFO Opening database path="/root/.local/share/reth/mainnet/db"
2025-05-10T11:40:59.546317Z ERROR shutting down due to error
Error: failed to open the database: unknown error code: 95 (95)

Location:
    /home/runner/work/reth/reth/crates/storage/db/src/mdbx.rs:28:8

May 10 '25 14:05 No0key

2025-05-10T11:40:59.546317Z ERROR shutting down due to error Error: failed to open the database: unknown error code: 95 (95)

From the libmdbx' developer point of view:

Bindings should use mdbx_strerror_r() to get readable error code information.
The 95 is the EOPNOTSUPP (Operation not supported on transport endpoint) error from the system/kernel (not from libmdbx). Most likely, the reason is different and has nothing to do with EAGAIN=11 discussed in this issue.
To dig the reason, debug logging from libmdbx and strace utility should be used.
Preliminary I can suppose that the reason is that the libmdbx build was made for glibc and/or kernel version(s) that differ from the actually available one(s) in the container.

May 11 '25 10:05 erthink

@No0key, especially for your case:

Docker image: jrei/systemd-ubuntu:24.04 Host OS: macOS Sequoia 15.1.1 (ARM)

docker run -it jrei/systemd-ubuntu:24.04 /bin/bash wget https://github.com/paradigmxyz/reth/releases/download/v1.3.12/reth-v1.3.12-x86_64-unknown-linux-gnu.tar.gz

You try to run AMD64 binary code on ARM64 system. In general such is not possible with libmdbx since a low-level structures (both for locking and lock-free) are used and placed within a shared memory.

So, for such possibility, for instance, the kernel futexes and shared POSIX mutexes structure and implementation must be bit-to-bit mach between host ARM64 kernel and ADM64 guest user code -- but this is not possible, unless the such full cross-architecture compatibility/interoperability is requested/required and was provided initialy.

Futher, docker/qemu/etc lack a support for a lot of syscalls, ones options/flags, etc. So docker just return the 95.

May 11 '25 10:05 erthink

hi @erthink I tried the new patch, I didn't see unknown error code 11 anymore but I got another error: Another write transaction

I suggest you prepare a docker image with a ready-made script inside that will reproduce the problem in a self-sufficient and confident manner. I think this is the easiest way to provide me with the information I need to investigate the problem.

May 11 '25 17:05 erthink

checking in here to see if this is still an issue with the latest reth versions? if so, could anyone provide more logs for us to track down the issue?

Jun 05 '25 09:06 jenpaff

checking in here to see if this is still an issue with the latest reth versions? if so, could anyone provide more logs for us to track down the issue?

There were no changes in libmdbx that could have affected the situation (except for the experiment, which did not yield results and was purged long ago).

Jun 07 '25 10:06 erthink

This issue is stale because it has been open for 21 days with no activity.

Jun 29 '25 02:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jul 06 '25 02:07 github-actions[bot]

hi guys, latest update:

Lmdbx use pid in file lock, so in docker you need to add some extra parameters to make it work, for the details please check lmdbx's readme

Jul 09 '25 03:07 0x8f701

reth reth copied to clipboard

"Error: failed to open the database: unknown error code: 11 (11)" after restart and zombie process

Describe the bug

Steps to reproduce

Node logs

Platform(s)

What version/commit are you on?

What database version are you on?

Which chain / network are you on?

What type of node are you running?

What prune config do you use, if any?

If you've built Reth from source, provide the full command you used

Code of Conduct

reth
reth copied to clipboard