bsc icon indicating copy to clipboard operation
bsc copied to clipboard

Repeatative OOM since 24 hours from now

Open Programmierus opened this issue 2 years ago • 10 comments

System information

Geth version: 1.1.8 OS & Version: Linux

Expected behaviour

No OOM

Actual behaviour

image

Steps to reproduce the behaviour

This occurs on all of my 4 nodes and from what I hear by other operators too. The node simply runs out of memory minutes after start, crashes, restarts, catches up sync, crashes again etc... Something changed yesterday (see graph above)

Logs

Logs are clear, it just dies and restarts (due to my docker setup) - restart at 04:33:38:

INFO [03-18|04:33:05.935] Deep froze chain segment                 blocks=18 elapsed=24.189ms  number=16,067,141 hash=68fc5f..c6e2da
INFO [03-18|04:33:12.882] Imported new chain segment               blocks=1  txs=211  mgas=25.530  elapsed=6.698s    mgasps=3.811   number=16,157,142 hash=7420e9..bde1d3 dirty=368.59MiB
INFO [03-18|04:33:14.384] Imported new chain segment               blocks=1  txs=193  mgas=19.015  elapsed=1.501s    mgasps=12.665  number=16,157,143 hash=45e41f..7b01ae dirty=369.45MiB
INFO [03-18|04:33:15.461] do light process success at block        num=16,157,144
INFO [03-18|04:33:15.524] Imported new chain segment               blocks=1  txs=175  mgas=25.484  elapsed=1.083s    mgasps=23.517  number=16,157,144 hash=0b1946..f38018 dirty=371.60MiB
INFO [03-18|04:33:15.653] do light process success at block        num=16,157,144
INFO [03-18|04:33:15.702] Chain reorg detected                     number=16,157,143 hash=45e41f..7b01ae drop=1 dropfrom=0b1946..f38018 add=1 addfrom=b62420..dbe214
INFO [03-18|04:33:15.704] Imported new chain segment               blocks=1  txs=192  mgas=28.281  elapsed=179.876ms mgasps=157.227 number=16,157,144 hash=b62420..dbe214 dirty=371.26MiB
INFO [03-18|04:33:16.030] Re-queue blocks                          number=16,157,145 hash=0xb052b61c25a0bafff69a222a2d49e010c9c30d3b3c71edac04de63dd1a92743a
INFO [03-18|04:33:19.147] do light process success at block        num=16,157,145
INFO [03-18|04:33:19.291] Imported new chain segment               blocks=1  txs=119  mgas=14.642  elapsed=499.742ms mgasps=29.299  number=16,157,145 hash=b052b6..92743a dirty=373.30MiB
INFO [03-18|04:33:20.066] do light process success at block        num=16,157,146
INFO [03-18|04:33:20.336] Imported new chain segment               blocks=1  txs=156  mgas=21.399  elapsed=982.937ms mgasps=21.771  number=16,157,146 hash=b2f10e..39ae59 dirty=375.01MiB
INFO [03-18|04:33:22.736] Imported new chain segment               blocks=1  txs=119  mgas=14.475  elapsed=1.489s    mgasps=9.720   number=16,157,147 hash=7abd60..da01fd dirty=375.90MiB
INFO [03-18|04:33:30.007] Imported new chain segment               blocks=1  txs=167  mgas=25.477  elapsed=2.279s    mgasps=11.178  number=16,157,148 hash=b1c1a2..fa7b07 dirty=377.75MiB
INFO [03-18|04:33:32.917] Imported new chain segment               blocks=1  txs=272  mgas=37.766  elapsed=2.348s    mgasps=16.078  number=16,157,149 hash=094050..bf3fc2 dirty=380.62MiB
INFO [03-18|04:33:36.320] do light process success at block        num=16,157,150
INFO [03-18|04:33:36.789] Imported new chain segment               blocks=1  txs=178  mgas=20.199  elapsed=2.870s    mgasps=7.037   number=16,157,150 hash=86d09c..a5ccbb dirty=381.59MiB
INFO [03-18|04:33:38.745] Imported new chain segment               blocks=1  txs=113  mgas=14.315  elapsed=1.910s    mgasps=7.492   number=16,157,151 hash=1b581a..2afc24 dirty=382.04MiB
/entrypoint.sh: assuming arguments for bsc
/entrypoint.sh: setting data directory to /data

INFO [03-18|04:33:42.291] Starting pprof server                    addr=http://0.0.0.0:6060/debug/pprof
INFO [03-18|04:33:42.293] Starting Geth on Ethereum mainnet...
INFO [03-18|04:33:42.293] Enabling metrics collection
INFO [03-18|04:33:42.299] Maximum peer count                       ETH=350 LES=0 total=350
INFO [03-18|04:33:42.299] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
WARN [03-18|04:33:42.299] Option nousb is deprecated and USB is deactivated by default. Use --usb to enable
INFO [03-18|04:33:42.300] Set global gas cap                       cap=25,000,000
INFO [03-18|04:33:42.301] Allocated trie memory caches             clean=9.23GiB dirty=6.59GiB
INFO [03-18|04:33:42.301] Allocated cache and file handles         database=/data/geth/chaindata cache=10.55GiB handles=524,288
INFO [03-18|04:33:56.541] Opened ancient database                  database=/data/geth/chaindata/ancient readonly=false
INFO [03-18|04:33:56.556] Initialised chain configuration          config="{ChainID: 56 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 Constantinople: 0 Petersburg: 0 Istanbul: 0, Muir Glacier: 0, Ramanujan: 0, Niels: 0, MirrorSync: 5184000, Bruno: 13082000, Berlin: <nil>, YOLO v3: <nil>, Engine: parlia}"
INFO [03-18|04:33:56.562] Deep froze chain segment                 blocks=10 elapsed=8.340ms number=16,067,151 hash=1686df..6a862d
INFO [03-18|04:34:07.204] Initialising Ethereum protocol           network=56 dbversion=8
INFO [03-18|04:34:15.660] Loaded most recent local header          number=16,157,151 hash=1b581a..2afc24 td=32,116,076 age=39s
INFO [03-18|04:34:15.660] Loaded most recent local full block      number=16,157,151 hash=1b581a..2afc24 td=32,116,076 age=39s
INFO [03-18|04:34:15.660] Loaded most recent local fast block      number=16,157,151 hash=1b581a..2afc24 td=32,116,076 age=39s
INFO [03-18|04:34:15.661] Loaded last fast-sync pivot marker       number=6,920,578
WARN [03-18|04:34:15.663] Head state missing, repairing            number=16,157,151 hash=1b581a..2afc24
INFO [03-18|04:34:15.990] Loaded most recent local header          number=16,157,151 hash=1b581a..2afc24 td=32,116,076 age=39s
INFO [03-18|04:34:15.990] Loaded most recent local full block      number=16,156,997 hash=93e8f7..9a8974 td=32,115,782 age=8m47s
INFO [03-18|04:34:15.990] Loaded most recent local fast block      number=16,157,151 hash=1b581a..2afc24 td=32,116,076 age=39s
INFO [03-18|04:34:15.990] Loaded last fast-sync pivot marker       number=6,920,578
WARN [03-18|04:34:16.036] Sanitizing invalid txpool reannounce time provided=0s updated=1m0s
WARN [03-18|04:34:16.038] Sanitizing invalid gasprice oracle price cap provided=<nil> updated=500,000,000,000
WARN [03-18|04:34:16.039] Old unclean shutdowns found              count=109
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T02:54:20+0000 age=1h39m56s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:03:42+0000 age=1h30m34s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:13:46+0000 age=1h20m30s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:22:50+0000 age=1h11m26s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:32:23+0000 age=1h1m53s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:39:28+0000 age=54m48s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:48:08+0000 age=46m8s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T03:57:13+0000 age=37m3s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T04:07:42+0000 age=26m34s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T04:15:09+0000 age=19m7s
WARN [03-18|04:34:16.039] Unclean shutdown detected                booted=2022-03-18T04:23:25+0000 age=10m51s
INFO [03-18|04:34:16.039] Starting peer-to-peer node               instance=Geth/v1.1.8-859186f2/linux-amd64/go1.16.13
INFO [03-18|04:34:16.083] New local node record                    seq=242 id=ec5982912a5ab8e6 ip=127.0.0.1 udp=30311 tcp=30311
INFO [03-18|04:34:16.084] Started P2P networking                   self=enode://761bae040c78a72b2a0c1d6bdb242439bf5887cd6f1f0361b6e091683e6b75c464c3e3b42878a8d6d927858f9ad50262a1e056784d1bcaea64f42f5392c99efa@127.0.0.1:30311
INFO [03-18|04:34:16.086] IPC endpoint opened                      url=/data/geth.ipc
INFO [03-18|04:34:16.086] HTTP server started                      endpoint=[::]:8545 prefix= cors= vhosts=*
INFO [03-18|04:34:16.086] WebSocket enabled                        url=ws://[::]:8546
INFO [03-18|04:34:16.197] Block synchronisation started
INFO [03-18|04:34:16.609] Downloader queue stats                   receiptTasks=0 blockTasks=0 itemSize=15.88KiB throttle=4128
INFO [03-18|04:34:16.685] Importing sidechain segment              start=16,156,998 end=16,157,153
INFO [03-18|04:34:19.169] New local node record                    seq=243 id=ec5982912a5ab8e6 ip=162.55.82.39 udp=30311 tcp=30311
INFO [03-18|04:34:25.315] Imported new chain segment               blocks=11 txs=1687 mgas=223.734 elapsed=8.630s  mgasps=25.925 number=16,157,008 hash=e70b6d..881569 age=8m24s    dirty=27.11MiB
INFO [03-18|04:34:33.481] Imported new chain segment               blocks=14 txs=2185 mgas=299.933 elapsed=8.166s  mgasps=36.728 number=16,157,022 hash=713aca..ffdd46 age=7m50s    dirty=63.56MiB
INFO [03-18|04:34:41.797] Imported new chain segment               blocks=17 txs=2706 mgas=372.304 elapsed=8.315s  mgasps=44.772 number=16,157,039 hash=7dd0b9..0f5292 age=7m7s     dirty=106.64MiB
INFO [03-18|04:34:50.007] Imported new chain segment               blocks=18 txs=2722 mgas=367.450 elapsed=8.210s  mgasps=44.753 number=16,157,057 hash=14bc71..94e025 age=6m22s    dirty=151.00MiB
INFO [03-18|04:34:58.647] Imported new chain segment               blocks=17 txs=2827 mgas=390.121 elapsed=8.639s  mgasps=45.155 number=16,157,074 hash=f8ea1c..61f206 age=5m35s    dirty=197.57MiB

Programmierus avatar Mar 18 '22 04:03 Programmierus

got it, we will check that

SolidityGo avatar Mar 18 '22 05:03 SolidityGo

The same happens to me. I start the node indicating 30GB as maximum memory in the parameter but after several hours the geth process occupies the 126 GB of RAM and shuts down without giving any error message.

dgdeivid avatar Mar 18 '22 18:03 dgdeivid

~Issue seems to be fixed~ Not fixed.

image

~I haven't changed anything on my side though. Would be interesting to hear some comments what has happened.~

I have pulled off any requests from the node and it worked. Now I just returned it back to get normal load (which are regular websockets requests) - they killed it in just a matter of minute being just a few of them... I will try to research further what particular request is causing it.

UPDATE: It again started to happen without any specific load:

image

Programmierus avatar Mar 19 '22 04:03 Programmierus

Guys, any update on this? Seems any BSC node running current version is affected badly...

Programmierus avatar Mar 21 '22 07:03 Programmierus

Guys, any update on this? Seems any BSC node running current version is affected badly...

We are checking that, currently you can try using V1.1.7.

SolidityGo avatar Mar 22 '22 05:03 SolidityGo

Take a look at your startup command, --cache can't be set more than 1/3 * your ram size

cosinlink avatar Mar 22 '22 06:03 cosinlink

Take a look at your startup command, --cache can't be set more than 1/3 * your ram size

Here is my current startup command. --cache is 27GB on a host with 128GB RAM... It's bare metal, nothing besides BSC node runs there. Graphs above also giving a picture of normal state of 27GB and spike of roughly 4-5x before it dies.

image

As I mentioned before this setup was running smooth and steady since months... Even 1.1.8 was running fine, I upgraded it in February already. It just started to OOM out of the blue on 17-th of March. I can see other external nodes dying with exactly same pattern, they keep crashing and restarting.

Programmierus avatar Mar 22 '22 06:03 Programmierus

When the memory surges, can you take a profile for the node? You can use curl -sK -v http://127.0.0.1:6060/debug/pprof/heap > heap.profile. You can enable the pprof server by adding --pprof in the command line.

yutianwu avatar Mar 23 '22 02:03 yutianwu

I am currently trying to research some pattern why it happens. Seems to be some sort of requests. I'll update this ticket once I have more details (will collect profile too).

Programmierus avatar Mar 23 '22 05:03 Programmierus

I am currently trying to research some pattern why it happens. Seems to be some sort of requests. I'll update this ticket once I have more details (will collect profile too).

Is there any news?

dgdeivid avatar Jul 26 '22 20:07 dgdeivid

This OOM behaviour could be addressed to a portion of private code. Closing.

Programmierus avatar Oct 17 '22 14:10 Programmierus