Production Caplin nodes freeze after snapshot aggregation while Prysm on identical setup continues
System information
Erigon version: v3.3.2 (may affect other 3.x versions - similar symptoms observed previously but root cause only now identified)
OS & Version: Linux (Fedora 42/43)
Commit hash: f1c398aaaacbe3f1517c953761c0a878830625d2 (current test environment; issue observed across multiple versions/commits)
Erigon Command (with flags/config):
erigon --datadir /var/lib/erigon --prune.mode=archive --private.api.addr=127.0.0.1:9090 --http --http.compression --http.addr=0.0.0.0 --http.api eth,net,web3,engine,debug,trace,erigon --ws --ws.compression --rpc.gascap 0 --rpc.returndata.limit=10000000 --authrpc.jwtsecret=/etc/jwt.hex --db.size.limit=8TB --db.pagesize=16k --torrent.download.rate=256mb --torrent.upload.rate=8mb --downloader.disable.ipv6=true --nat=extip:aa.bb.cc.dd
Consensus Layer: Caplin (internal)
Consensus Layer Command (with flags/config): N/A (using internal Caplin)
Chain/Network: Ethereum mainnet
Hardware tested:
- AWS i4i.2xlarge (8 vCPU, 64GB RAM, NVMe) - x86_64
- AWS x2gd.xlarge (4 vCPU, 32GB RAM, NVMe) - ARM64/Graviton
- AWS i7i.xlarge (4 vCPU, 32GB RAM, NVMe) - x86_64
We're happy to provide additional logs, run diagnostic builds, or test patches to help resolve this.
Expected behaviour
After snapshot aggregation completes, Caplin should continue delivering blocks to the execution layer without interruption. [NewPayload] messages should continue appearing in logs, and head updates should proceed normally.
Actual behaviour
Caplin stops delivering blocks for 3-20 minutes following snapshot aggregation operations. Specifically:
- Snapshot aggregation completes normally:
[snapshots] aggregated step=XXXX took=Xm - Caplin stops delivering blocks to the execution layer
- Logs show only periodic P2P heartbeats (
P2P app=caplin peers=127) - No
[NewPayload]messages appear - No head updates occur
- Execution layer remains idle, waiting for consensus input
- Condition persists indefinitely until service restart
Log pattern during stall:
[INFO] P2P app=caplin peers=127
[INFO] P2P app=caplin peers=127
[INFO] P2P app=caplin peers=127
[INFO] P2P app=caplin peers=127
Recovery behavior: In some cases, affected nodes do eventually resume block processing without a restart, but in a degraded state: they process blocks slowly and continuously fall further behind chain tip, never fully catching up. A service restart resolves this and allows normal sync to complete. This suggests Caplin may enter a throttled or resource-constrained state that persists until restart.
Side effect - merge interruption: If the service is restarted during an in-progress merge, the operation is interrupted:
[WARN] [snapshots] state merge failed err=build idx: loadIntoTable : stopped
The merge must then restart from the beginning after service recovery.
Steps to reproduce the behaviour
Note on reproduction: This issue cannot be triggered on demand - it requires waiting for snapshot aggregation events which occur hours or sometimes days apart. However, when it does occur, it affects multiple nodes simultaneously and is critically disruptive to our operations, requiring immediate intervention across the fleet. We've documented multiple occurrences to provide as much data as possible.
- Run Erigon with Caplin (default, no
--externalcl) in archive mode - Wait for a snapshot aggregation event (occurs periodically as chain progresses)
- Observe logs after
[snapshots] aggregated step=XXXXmessage appears - Note absence of
[NewPayload]messages and head updates while P2P heartbeats continue
Validation performed:
To confirm the issue is Caplin-specific, we ran a comparative test with three nodes hitting step 2039 aggregation simultaneously:
| Node | Hardware | Aggregation Time | Consensus Client | Post-Aggregation Behavior |
|---|---|---|---|---|
| Node 1 | i4i.2xlarge | 5m24s | Caplin | Stall, no blocks for 4+ minutes |
| Node 2 | x2gd.xlarge (ARM64) | 6m44s | Caplin | Stall, no blocks for 4+ minutes |
| Node 3 | i7i.xlarge | 7m29s | Prysm (--externalcl) |
Continuous block delivery |
The Prysm node continued processing blocks throughout (with [NewPayload] messages appearing continuously), confirming the issue is specific to Caplin rather than the execution layer or snapshot system.
Isolated stall example (Node 2, Dec 18 ~19:25 UTC):
- 19:24:28: Last head update (block 24041258)
- 19:25:01:
[5/5 Execution][agg] computing triebegins - 19:25:08 - 19:29:08: Only P2P heartbeats visible, no block delivery
- 19:29:08: Service restarted by external watchdog
Backtrace
No crash/backtrace - the service remains running but Caplin stops delivering blocks. The issue manifests as a functional stall rather than a crash.
Possible investigation areas:
- What internal state does Caplin enter after snapshot aggregation completes?
- Why do some nodes resume processing but in a degraded state that never catches up?
- Are block requests being made to peers but not processed?
- Could heavy I/O during aggregation cause a deadlock or blocking condition in Caplin's processing loop?
We're running Caplin in production because of its performance advantages and architectural elegance as an integrated solution. The goal of this report is to help identify and resolve this issue. Happy to assist however we can!