reth
reth copied to clipboard
"Error: failed to open the database: unknown error code: 11 (11)" after restart and zombie process
Describe the bug
op-reth ended up with this in logs. When tried to restart via docker, it says it's zombie and I was unable to restart. So tried redeploy and ended up with corrupted database and failed deployment - see logs
op-reth: 1.1.2
Steps to reproduce
no idea how to reproduce. The node might be under a heavy load, so some thread could be lost/stuck and then it ended up like this.
Node logs
logs prior restart
`
2024-11-25T08:29:33.115935Z INFO Canonical chain committed number=22851179 hash=0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762 elapsed=193.247µs
2024-11-25T08:29:35.065377Z INFO Status connected_peers=59 latest_block=22851179
2024-11-25T08:29:58.786640Z INFO Block added to canonical chain number=22851180 hash=0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b peers=59 txs=291 gas=44.07 Mgas gas_throughput=1.72 Mgas/second full=24.5% base_fee=0.04gwei blobs=0 excess_blobs=0 elapsed=25.663442107s
2024-11-25T08:29:58.867764Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b) })
2024-11-25T08:29:58.870594Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b) })
2024-11-25T08:29:58.873429Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b) })
2024-11-25T08:29:58.873449Z ERROR Failed to send event: Ok(OnForkChoiceUpdated { forkchoice_status: Valid, fut: Left(Ready(Some(Ok(ForkchoiceUpdated { payload_status: PayloadStatus { status: Valid, latest_valid_hash: Some(0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762) }, payload_id: None })))) })
2024-11-25T08:29:58.873460Z ERROR Failed to send event: Ok(OnForkChoiceUpdated { forkchoice_status: Valid, fut: Left(Ready(Some(Ok(ForkchoiceUpdated { payload_status: PayloadStatus { status: Valid, latest_valid_hash: Some(0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762) }, payload_id: None })))) })
2024-11-25T08:29:58.874358Z INFO New payload job created id=0x038f68f431015618 parent=0x6aa9553aa5dfb1f0063dfec4dcbf0f3e647ec45785fdc73059e66e7d39f6a762
2024-11-25T08:29:58.881917Z INFO Canonical chain committed number=22851180 hash=0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b elapsed=212.653µs
2024-11-25T08:30:00.065854Z INFO Status connected_peers=59 latest_block=22851180
2024-11-25T08:30:25.065833Z INFO Status connected_peers=59 latest_block=22851180
2024-11-25T08:30:35.652512Z INFO Block added to canonical chain number=22851181 hash=0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1 peers=59 txs=329 gas=31.60 Mgas gas_throughput=1.19 Mgas/second full=17.6% base_fee=0.04gwei blobs=0 excess_blobs=0 elapsed=26.55651455s
2024-11-25T08:30:35.729705Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.733282Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.736579Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.738976Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.741338Z ERROR Failed to send event: Ok(PayloadStatus { status: Valid, latest_valid_hash: Some(0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1) })
2024-11-25T08:30:35.742247Z INFO New payload job created id=0x034a6bd27fd3097a parent=0x1f03f982daba7a37a07010d7f0b9b8e5480a933a1950114ec05832d435ae916b
2024-11-25T08:30:35.747152Z INFO Canonical chain committed number=22851181 hash=0x002b31571a9688b9ea16c3fe41a8e9409c82a9b6143d7a7aa4a5ca6b7d7765e1 elapsed=249.103µs
2024-11-25T08:30:50.065957Z INFO Status connected_peers=59 latest_block=22851181
`
after redeployment
`
2024-11-25T09:13:07.146101Z INFO Initialized tracing, debug log directory: /root/.cache/reth/logs/base
2024-11-25T09:13:07.148752Z INFO Starting reth version="1.1.2 (496bf0bf)"
2024-11-25T09:13:07.148775Z INFO Opening database path="/data/db"
2024-11-25T09:13:07.149477Z ERROR shutting down due to error
Error: failed to open the database: unknown error code: 11 (11)
Location:
/project/crates/storage/db/src/mdbx.rs:28:8
`
Platform(s)
Linux (x86)
What version/commit are you on?
reth-optimism-cli Version: 1.1.2
Commit SHA: 496bf0bf715f0a1fafc198f8d72ccd71913d1a40
Build Timestamp: 2024-11-19T10:54:10.868121467Z
Build Features: asm_keccak,jemalloc,optimism
Build Profile: maxperf
What database version are you on?
Current database version: 2
Local database version: 2
Which chain / network are you on?
base mainnet archive
What type of node are you running?
Archive (default)
What prune config do you use, if any?
No response
If you've built Reth from source, provide the full command you used
No response
Code of Conduct
- [x] I agree to follow the Code of Conduct
I have a similar issue and traced it down to self.txn.txn_execute(|_| unsafe { ffi::mdbx_cursor_close(self.cursor) }).unwrap() in libmdbx-rs. Might not be the same root cause though, I'm not sure.
Another similar occurrence I found in the reth group https://t.me/paradigm_reth/27096
same , v0.11.0-reth this occurs after reth stuck for a while then restart
can't find this line of code in latest version @tch1001 . is this still an issue on latest version @tmeinlschmidt ?
I was running a reth node, stopped it, started a manual prune, interrupted the prune midway, and restarting the prune fails with the same error:
$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth prune -vvv
2025-02-25T10:26:02.175337Z INFO Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-02-25T10:26:02.178849Z INFO Opening storage db_path="/root/.local/share/reth/mainnet/db" sf_path="/root/.local/share/reth/mainnet/static_files"
2025-02-25T10:26:02.225099Z INFO Verifying storage consistency.
2025-02-25T10:26:02.299224Z INFO Copying data from database to static files...
2025-02-25T10:26:02.301585Z INFO Copied data from database to static files lowest_static_file_height=Some(21456689)
2025-02-25T10:26:02.301597Z INFO Pruning data from database... prune_tip=21456689 prune_config=PruneConfig { block_interval: 5, segments: PruneModes { sender_recovery: Some(Full), transaction_lookup: Some(Full), receipts: Some(Distance(10064)), account_history: Some(Distance(10064)), storage_history: Some(Distance(10064)), receipts_log_filter: ReceiptsLogPruneConfig({}) } }
^C
$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth prune -vvvvv
2025-02-25T10:28:35.345733Z INFO reth::cli: Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-02-25T10:28:35.349731Z INFO reth::cli: Opening storage db_path="/root/.local/share/reth/mainnet/db" sf_path="/root/.local/share/reth/mainnet/static_files"
Error: failed to open the database: unknown error code: 11 (11)
Location:
/project/crates/storage/db/src/mdbx.rs:28:8
$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth --version
reth Version: 1.2.0
Commit SHA: 1e0b0d897b372226b3f0ebf911b5176132c322d7
Build Timestamp: 2025-02-12T16:48:04.304586958Z
Build Features: asm_keccak,jemalloc
Build Profile: maxperf
Uh, I just restarted the manual prune and now it works fine(?)
$ docker run -v rethdata:/root/.local/share/reth/mainnet ghcr.io/paradigmxyz/reth prune -vvvvvv
2025-02-25T11:14:04.354832Z INFO reth::cli: Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-02-25T11:14:04.368149Z INFO reth::cli: Opening storage db_path="/root/.local/share/reth/mainnet/db" sf_path="/root/.local/share/reth/mainnet/static_files"
2025-02-25T11:14:04.412020Z INFO reth::cli: Verifying storage consistency.
2025-02-25T11:14:04.490016Z DEBUG reth::cli: Initializing genesis chain=mainnet genesis=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3
2025-02-25T11:14:04.492738Z DEBUG reth_db_common::init: Genesis already written, skipping.
2025-02-25T11:14:04.492744Z INFO reth::cli: Copying data from database to static files...
2025-02-25T11:14:04.493881Z DEBUG static_file: StaticFileProducer started targets=StaticFileTargets { headers: None, receipts: None, transactions: None, block_meta: Some(0..=21456689) }
2025-02-25T11:14:04.496052Z DEBUG static_file: StaticFileProducer finished targets=StaticFileTargets { headers: None, receipts: None, transactions: None, block_meta: Some(0..=21456689) } elapsed=2.133375ms
2025-02-25T11:14:04.496096Z INFO reth::cli: Copied data from database to static files lowest_static_file_height=Some(21456689)
2025-02-25T11:14:04.496100Z INFO reth::cli: Pruning data from database... prune_tip=21456689 prune_config=PruneConfig { block_interval: 5, segments: PruneModes { sender_recovery: Some(Full), transaction_lookup: Some(Full), receipts: Some(Distance(10064)), account_history: Some(Distance(10064)), storage_history: Some(Distance(10064)), receipts_log_filter: ReceiptsLogPruneConfig({}) } }
2025-02-25T11:14:04.497451Z DEBUG pruner: Pruner started tip_block_number=21456689
2025-02-25T11:14:04.498218Z DEBUG pruner: Segment pruning started segment=Headers purpose=StaticFile to_block=21456689 prune_mode=Before(21456690)
2025-02-25T11:14:04.499734Z DEBUG pruner: Segment pruning finished segment=Headers purpose=StaticFile to_block=21456689 prune_mode=Before(21456690) segment_output.pruned=0
2025-02-25T11:14:04.500239Z DEBUG pruner: Segment pruning started segment=Transactions purpose=StaticFile to_block=21456689 prune_mode=Before(21456690)
2025-02-25T11:14:04.500248Z DEBUG pruner: Segment pruning finished segment=Transactions purpose=StaticFile to_block=21456689 prune_mode=Before(21456690) segment_output.pruned=0
2025-02-25T11:14:04.500624Z DEBUG pruner: Segment pruning started segment=Receipts purpose=StaticFile to_block=0 prune_mode=Before(1)
2025-02-25T11:14:04.501081Z DEBUG pruner: Segment pruning finished segment=Receipts purpose=StaticFile to_block=0 prune_mode=Before(1) segment_output.pruned=0
2025-02-25T11:14:04.501490Z DEBUG pruner: Segment pruning started segment=AccountHistory purpose=User to_block=21446625 prune_mode=Distance(10064)
2025-02-25T11:14:20.657722Z DEBUG pruner: Segment pruning finished segment=AccountHistory purpose=User to_block=21446625 prune_mode=Distance(10064) segment_output.pruned=1686478
2025-02-25T11:14:20.657742Z DEBUG pruner: Segment pruning started segment=StorageHistory purpose=User to_block=21446625 prune_mode=Distance(10064)
Hm, seem a really stupid and/or bad rust-bindings are used for libmdbx (key-value storage engine), otherwise you was not get an "unknown error code: 11".
This is actually the EAGAIN (on Linux) system error and libmdbx provides mdbx_strerror_r() to getting corresponding string/message by a given error code.
Recently I have commented Isar' issue https://github.com/isar/isar/issues/1068#issuecomment-2726769493. Please ref to it for reasons description and workaround.
@erthink This problem is common in Docker environments, while it's not easy to encounter in the host environment. I've also tried vorot93/libmdbx-rs, but there are similar problems, except it's not 'unknown error code 11' but 'Resource temporarily unavailable'. Can retrying solve the problem? It seems not to work in my case
@erthink This problem is common in Docker environments, while it's not easy to encounter in the host environment. I've also tried vorot93/libmdbx-rs, but there are similar problems, except it's not 'unknown error code 11' but 'Resource temporarily unavailable'. Can retrying solve the problem? It seems not to work in my case
In my current understanding, the cause is within Reth, not MDBX, since Erigon don't have such issues (AFAIK). However, I don't see the whole picture and I have neither the time nor the desire to delve into it, as Reth predisposed to refuses help/advice/cooperate, rather than accept it (at least my appeals and advice are ignored before mostly).
Nonetheless, there are two completely different reasons for making a EAGAIN error:
- Lack of resources or exceeding quotas.
- A some issues with locking and/or sharing access when opening the database.
So my advices will as follows:
- Read the Containers and check/turn your case to be conform for a requirements.
- Provide
stracelog using logged-enabled (debug) build of libmdbx:- build libmdbx with logging/debug enabled by
MDBX_DEBUG=1option; - setup a logger-function and
MDBX_LOG_DEBUGlog-vevel by mdbx_setup_debug(); - run Reth under
straceand reproduce the issue; - show the log and/or send it to me (telegram is preferred).
- build libmdbx with logging/debug enabled by
fyi i believe
disk-perf>single-core-cpu-perfmight being your bottleneck then it's easier to run into that corruption after some io/call making deadlocks etci run reth-archive nodes for the most demanding l2 evm chains like base by docker-compose, struggled for weeks but now all good
Need the best disks nvme-raid0 on filesystem without compression, also may consider seccomp:unconfined for performance boost
This is whole irrelevant at all.
However, it is obvious that in case of failure or shutdown, it is necessary to wait for the container with Reth to fully terminate/finalyze before launching a new one (especially if Reth uses an exclusive MDBX mode, but I have no information about this).
@erthink FYI, I tried to switch libmdbx's version to v0.13.6 but this error still happens while creating an env. Probably it's just an issue on docker
reth version
osal_lck_seize:34987 lock-against-without-lck, err 11
v0.13.6
lck_seize:24674 lock-against-without-lck, err 11
mdbx_env_open:9815 error 11 (Resource temporarily unavailable)
ref: https://github.com/erigontech/erigon/issues/8552
Although this is not a problem that frequently occurs on host machines, it significantly impacts the use of libmdbx in Docker, especially in test environments where we need to start numerous services, which makes this issue particularly challenging. Libmdbx is the only key-value store I know that offers good support for multiple processes. We are temporarily unable to switch to other key-value stores for testing.
@0x8f701, thank for information.
However, it very much seems that it's not about transients during restarts, etc., but that Reth uses MDBX_EXCLUSIVE mode for DB and thus just prevents opening a DB from other process(es)/container(s).
Nonetheless I've prepared a rough patch that just does a few repetitions of the corresponding lock-taking step. Please checkout the release-engineering branch on Gitflic.
If this helps solve the problem, then I will finalize this fix and add it to the stable branch (based on which libmdbx v0.13.7 will be released at the end of May). Otherwise we should dig deeper.
@erthink thanks! btw I didn't use libmdbx in reth. I use it in our own multiple processes case where there are one writer and multiple readers. So definitely not EXCLUSIVE.
I'll try that branch soon abd let you know
I use it in our own multiple processes case where there are one writer and multiple readers. So definitely not EXCLUSIVE.
@0x8f701,
There may be additional difficulties in spawn scenarios using libmdbx. Basically the mdbx_env_resurrect_after_fork() or a DB should be opened after the fork().
However, a basic/general multi-process cases are much checked, since uses as a basis for libmdbx' test framework. Thus, if you encounter an EAGAIN error in such cases, then most likely there is some special particular reason for this.
Therefore, I doubt that my trial/experimental patch will help. I think it will be necessary to look at the whole picture and analyze everything that happens with a DB, including the activity of all containers and processes working with it.
hi @erthink I tried the new patch, I didn't see unknown error code 11 anymore but I got another error: Another write transaction
I'm encountering a similar error when trying to launch Reth. I am attempting to launch Reth in Docker on Ubuntu 24.04 (as part of an Ansible Molecule test).
Docker image: jrei/systemd-ubuntu:24.04 Host OS: macOS Sequoia 15.1.1 (ARM)
Steps to reproduce:
docker run -it jrei/systemd-ubuntu:24.04 /bin/bash
wget https://github.com/paradigmxyz/reth/releases/download/v1.3.12/reth-v1.3.12-x86_64-unknown-linux-gnu.tar.gz
tar -xvzf reth-v1.3.12-x86_64-unknown-linux-gnu.tar.gz
./reth node
Logs:
root@reth-ubuntu24:~# reth node
<jemalloc>: MADV_DONTNEED does not work (memset will be used instead)
<jemalloc>: (This is the expected behaviour if you are running under QEMU)
2025-05-10T11:40:59.510019Z INFO Initialized tracing, debug log directory: /root/.cache/reth/logs/mainnet
2025-05-10T11:40:59.524539Z INFO Starting reth version="1.3.12 (6f8e725)"
2025-05-10T11:40:59.525333Z INFO Opening database path="/root/.local/share/reth/mainnet/db"
2025-05-10T11:40:59.546317Z ERROR shutting down due to error
Error: failed to open the database: unknown error code: 95 (95)
Location:
/home/runner/work/reth/reth/crates/storage/db/src/mdbx.rs:28:8
2025-05-10T11:40:59.546317Z ERROR shutting down due to error Error: failed to open the database: unknown error code: 95 (95)
From the libmdbx' developer point of view:
- Bindings should use mdbx_strerror_r() to get readable error code information.
- The
95is theEOPNOTSUPP(Operation not supported on transport endpoint) error from the system/kernel (not from libmdbx). Most likely, the reason is different and has nothing to do withEAGAIN=11discussed in this issue. - To dig the reason, debug logging from libmdbx and
straceutility should be used. - Preliminary I can suppose that the reason is that the libmdbx build was made for glibc and/or kernel version(s) that differ from the actually available one(s) in the container.
@No0key, especially for your case:
Docker image: jrei/systemd-ubuntu:24.04 Host OS: macOS Sequoia 15.1.1 (ARM)
docker run -it jrei/systemd-ubuntu:24.04 /bin/bash wget https://github.com/paradigmxyz/reth/releases/download/v1.3.12/reth-v1.3.12-x86_64-unknown-linux-gnu.tar.gz
You try to run AMD64 binary code on ARM64 system. In general such is not possible with libmdbx since a low-level structures (both for locking and lock-free) are used and placed within a shared memory.
So, for such possibility, for instance, the kernel futexes and shared POSIX mutexes structure and implementation must be bit-to-bit mach between host ARM64 kernel and ADM64 guest user code -- but this is not possible, unless the such full cross-architecture compatibility/interoperability is requested/required and was provided initialy.
Futher, docker/qemu/etc lack a support for a lot of syscalls, ones options/flags, etc. So docker just return the 95.
hi @erthink I tried the new patch, I didn't see
unknown error code 11anymore but I got another error:Another write transaction
I suggest you prepare a docker image with a ready-made script inside that will reproduce the problem in a self-sufficient and confident manner. I think this is the easiest way to provide me with the information I need to investigate the problem.
checking in here to see if this is still an issue with the latest reth versions? if so, could anyone provide more logs for us to track down the issue?
checking in here to see if this is still an issue with the latest reth versions? if so, could anyone provide more logs for us to track down the issue?
There were no changes in libmdbx that could have affected the situation (except for the experiment, which did not yield results and was purged long ago).
This issue is stale because it has been open for 21 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
hi guys, latest update:
Lmdbx use pid in file lock, so in docker you need to add some extra parameters to make it work, for the details please check lmdbx's readme