erigon icon indicating copy to clipboard operation
erigon copied to clipboard

OOM while gathering headers (Polygon Mainnet / BorHeimdall)

Open darkxeno opened this issue 1 year ago • 1 comments

System information

Erigon Version: 2.56.1-9e63c927

OS & Version: Ubuntu 22.04.3 LTS

Erigon Command (with flags/config):

GOMEMLIMIT=16GiB GOGC=50 ./build/bin/erigon \
        --chain bor-mainnet \
        --datadir /chaindata/polygon/ \
        --bor.heimdall https://heimdall-api.polygon.technology \
        --port 30303 \
        --http \
        --http.addr "0.0.0.0" \
        --http.port 8545 \
        --http.api eth,debug,net,trace,web3,erigon \
        --http.vhosts "*" \
        --http.corsdomain "*" \
        --ws \
        --torrent.port 42069 \
        --txpool.pricelimit 30000000000 \
        --bootnodes='enode://b8f1cc9c5d4403703fbf377116469667d2b1823c0daf16b7250aa576bacf399e42c3930ccfcb02c5df6879565a2b8931335565f0e8d3f8e72385ecf4a4bf160a@3.36.224.80:30303,enode://8729e0c825f3d9cad382555f3e46d>
        --torrent.upload.rate="1024mb" \
        --torrent.download.rate="1024mb" \
        --torrent.conns.perfile=4 \
        --batchSize "1GB" \
        --etl.bufferSize "1GB" \
        --bodies.cache 21474836480 \
        --db.size.limit="12TB" \
        --db.pagesize="16KB" \
        --db.read.concurrency=1000 \
        --rpc.batch.concurrency=1000 \
        --downloader.verify \
        --pprof

Chain/Network:

Polygon Mainnet

Expected behaviour

To not break with an OOM error / kill. To respect limits / configs indicated by GOMEMLIMIT, GOMAXPROCS, GOGC.

Actual behaviour

Running this on a 32GB ram server, in a few mins the process reserves 30,2g of ram, soon after if get kill and exits with an 137 exit code

Feb 23 10:45:59 polygon-1 systemd[1]: erigon-polygon.service: Main process exited, code=exited, status=137/n/a
Feb 23 10:45:59 polygon-1 systemd[1]: erigon-polygon.service: Failed with result 'exit-code'.
Feb 23 10:45:59 polygon-1 systemd[1]: erigon-polygon.service: Consumed 1h 33min 50.627s CPU time.
Feb 23 11:10:39 polygon-1 systemd[1]: erigon-polygon.service: A process of this unit has been killed by the OOM killer.
Feb 23 11:10:41 polygon-1 systemd[1]: erigon-polygon.service: Failed with result 'oom-kill'.
Feb 23 11:10:41 polygon-1 systemd[1]: erigon-polygon.service: Consumed 1h 15min 8.853s CPU time.

Steps to reproduce the behaviour

  • Start the process
  • Wait for 15 - 30 mins
  • Check the systemd logs for a oom caused restart

Backtrace / Logs

Feb 23 11:05:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:05:18.861] [p2p] GoodPeers                          eth68=1
Feb 23 11:05:20 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:05:20.070] [txpool] stat                            pending=12 baseFee=0 queued=1938 alloc=22.6GB sys=24.6GB
Feb 23 11:05:48 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:05:48.225] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=29089987
Feb 23 11:06:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:06:18.226] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=27886352
Feb 23 11:06:48 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:06:48.226] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=26724741
Feb 23 11:07:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:07:18.238] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=25548755
Feb 23 11:07:48 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:07:48.477] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=24420396
Feb 23 11:08:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:08:18.622] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=23229773
Feb 23 11:08:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:08:18.860] [p2p] GoodPeers                          eth68=1
Feb 23 11:08:20 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:08:20.571] [txpool] stat                            pending=18 baseFee=0 queued=2826 alloc=28.9GB sys=31.6GB
Feb 23 11:08:48 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:08:48.771] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=22249133
Feb 23 11:09:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:09:18.240] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=22176295
Feb 23 11:09:48 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:09:48.515] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=22151759
Feb 23 11:10:18 polygon-1 erigon-polygon[223650]: [INFO] [02-23|11:10:18.352] [3/15 BorHeimdall] Gathering headers for validator proposer prorities (backwards) blockNum=22150299
Feb 23 11:10:37 polygon-1 erigon-polygon[223650]: [WARN] [02-23|11:10:37.819] [bor.heimdall] an error while fetching   path=/milestone/lastNoAck queryParams= attempt=1 err="Get \"https://heimdall-api.polygon.technology/milestone/lastNoAck\": context deadline exceeded"
Feb 23 11:10:39 polygon-1 systemd[1]: erigon-polygon.service: A process of this unit has been killed by the OOM killer.
Feb 23 11:10:41 polygon-1 systemd[1]: erigon-polygon.service: Failed with result 'oom-kill'.
Feb 23 11:10:41 polygon-1 systemd[1]: erigon-polygon.service: Consumed 1h 15min 8.853s CPU time.
Feb 23 11:10:41 polygon-1 systemd[1]: erigon-polygon.service: Scheduled restart job, restart counter is at 104.
go tool pprof -inuse_space -png http://127.0.0.1:6060/debug/pprof/heap > mem5.png

mem5

darkxeno avatar Feb 23 '24 11:02 darkxeno

Gj

Fm8914 avatar Feb 24 '24 04:02 Fm8914

Fixed by: https://github.com/ledgerwatch/erigon/pull/10027

mh0lt avatar Apr 23 '24 11:04 mh0lt