erigon icon indicating copy to clipboard operation
erigon copied to clipboard

Production Caplin nodes freeze after snapshot aggregation while Prysm on identical setup continues

Open lemenkov opened this issue 1 week ago • 0 comments

System information

Erigon version: v3.3.2 (may affect other 3.x versions - similar symptoms observed previously but root cause only now identified)

OS & Version: Linux (Fedora 42/43)

Commit hash: f1c398aaaacbe3f1517c953761c0a878830625d2 (current test environment; issue observed across multiple versions/commits)

Erigon Command (with flags/config):

erigon --datadir /var/lib/erigon --prune.mode=archive --private.api.addr=127.0.0.1:9090 --http --http.compression --http.addr=0.0.0.0 --http.api eth,net,web3,engine,debug,trace,erigon --ws --ws.compression --rpc.gascap 0 --rpc.returndata.limit=10000000 --authrpc.jwtsecret=/etc/jwt.hex --db.size.limit=8TB --db.pagesize=16k --torrent.download.rate=256mb --torrent.upload.rate=8mb --downloader.disable.ipv6=true --nat=extip:aa.bb.cc.dd

Consensus Layer: Caplin (internal)

Consensus Layer Command (with flags/config): N/A (using internal Caplin)

Chain/Network: Ethereum mainnet

Hardware tested:

  • AWS i4i.2xlarge (8 vCPU, 64GB RAM, NVMe) - x86_64
  • AWS x2gd.xlarge (4 vCPU, 32GB RAM, NVMe) - ARM64/Graviton
  • AWS i7i.xlarge (4 vCPU, 32GB RAM, NVMe) - x86_64

We're happy to provide additional logs, run diagnostic builds, or test patches to help resolve this.

Expected behaviour

After snapshot aggregation completes, Caplin should continue delivering blocks to the execution layer without interruption. [NewPayload] messages should continue appearing in logs, and head updates should proceed normally.

Actual behaviour

Caplin stops delivering blocks for 3-20 minutes following snapshot aggregation operations. Specifically:

  1. Snapshot aggregation completes normally: [snapshots] aggregated step=XXXX took=Xm
  2. Caplin stops delivering blocks to the execution layer
  3. Logs show only periodic P2P heartbeats (P2P app=caplin peers=127)
  4. No [NewPayload] messages appear
  5. No head updates occur
  6. Execution layer remains idle, waiting for consensus input
  7. Condition persists indefinitely until service restart

Log pattern during stall:

[INFO] P2P    app=caplin peers=127
[INFO] P2P    app=caplin peers=127
[INFO] P2P    app=caplin peers=127
[INFO] P2P    app=caplin peers=127

Recovery behavior: In some cases, affected nodes do eventually resume block processing without a restart, but in a degraded state: they process blocks slowly and continuously fall further behind chain tip, never fully catching up. A service restart resolves this and allows normal sync to complete. This suggests Caplin may enter a throttled or resource-constrained state that persists until restart.

Side effect - merge interruption: If the service is restarted during an in-progress merge, the operation is interrupted:

[WARN] [snapshots] state merge failed err=build idx: loadIntoTable : stopped

The merge must then restart from the beginning after service recovery.

Steps to reproduce the behaviour

Note on reproduction: This issue cannot be triggered on demand - it requires waiting for snapshot aggregation events which occur hours or sometimes days apart. However, when it does occur, it affects multiple nodes simultaneously and is critically disruptive to our operations, requiring immediate intervention across the fleet. We've documented multiple occurrences to provide as much data as possible.

  1. Run Erigon with Caplin (default, no --externalcl) in archive mode
  2. Wait for a snapshot aggregation event (occurs periodically as chain progresses)
  3. Observe logs after [snapshots] aggregated step=XXXX message appears
  4. Note absence of [NewPayload] messages and head updates while P2P heartbeats continue

Validation performed:

To confirm the issue is Caplin-specific, we ran a comparative test with three nodes hitting step 2039 aggregation simultaneously:

Node Hardware Aggregation Time Consensus Client Post-Aggregation Behavior
Node 1 i4i.2xlarge 5m24s Caplin Stall, no blocks for 4+ minutes
Node 2 x2gd.xlarge (ARM64) 6m44s Caplin Stall, no blocks for 4+ minutes
Node 3 i7i.xlarge 7m29s Prysm (--externalcl) Continuous block delivery

The Prysm node continued processing blocks throughout (with [NewPayload] messages appearing continuously), confirming the issue is specific to Caplin rather than the execution layer or snapshot system.

Isolated stall example (Node 2, Dec 18 ~19:25 UTC):

  • 19:24:28: Last head update (block 24041258)
  • 19:25:01: [5/5 Execution][agg] computing trie begins
  • 19:25:08 - 19:29:08: Only P2P heartbeats visible, no block delivery
  • 19:29:08: Service restarted by external watchdog

Backtrace

No crash/backtrace - the service remains running but Caplin stops delivering blocks. The issue manifests as a functional stall rather than a crash.

Possible investigation areas:

  1. What internal state does Caplin enter after snapshot aggregation completes?
  2. Why do some nodes resume processing but in a degraded state that never catches up?
  3. Are block requests being made to peers but not processed?
  4. Could heavy I/O during aggregation cause a deadlock or blocking condition in Caplin's processing loop?

We're running Caplin in production because of its performance advantages and architectural elegance as an integrated solution. The goal of this report is to help identify and resolve this issue. Happy to assist however we can!

lemenkov avatar Dec 19 '25 02:12 lemenkov