Sui oom-kill
After updating the Sui new version, the sui node is very unstable and often experiences oom kill Previously, servers with 64GB of memory could run smoothly, but now servers with 128GB of memory are all oom-kill
Oct 8 22:16:11 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 9 17:41:34 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 00:48:31 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 05:27:15 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 09:35:03 rockx-mainnet-merlin-sg-01 kernel: [15035328.923887] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 10 09:35:07 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 12:53:18 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 16:42:08 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 18:58:15 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 19:56:33 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 21:18:20 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 22:12:38 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 23:03:57 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 23:56:50 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 00:30:36 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 01:01:02 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 01:53:52 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:04:53 rockx-mainnet-merlin-sg-01 kernel: [15098318.993834] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:04:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:34:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:58:24 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 04:55:38 rockx-mainnet-merlin-sg-01 kernel: [15104964.510899] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 05:03:19 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 05:38:30 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 06:15:05 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 07:51:47 rockx-mainnet-merlin-sg-01 kernel: [ 1883.157041] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 07:51:53 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 08:19:59 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 09:06:43 rockx-mainnet-merlin-sg-01 kernel: [ 6378.583779] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 10:36:48 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 13:30:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 15:59:22 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 17:34:36 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 18:42:59 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 19:31:25 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 20:19:29 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 21:49:21 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 23:28:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 12 01:08:12 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 12 03:13:51 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: A process of this unit has been killed by the OOM killer. Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Failed with result 'oom-kill'. Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Consumed 1h 26min 28.180s CPU time.
@zhy827827 which version are you trying to run?
@stefan-mysten Run SUI can only use latest
full.yml:
authority-store-pruning-config:
num-latest-epoch-dbs-to-retain: 3
epoch-db-pruning-period-secs: 3600
num-epochs-to-retain: 0
max-checkpoints-in-batch: 10
max-transactions-in-batch: 1000
#use-range-deletion: true
pruning-run-delay-seconds: 60
num-epochs-to-retain-for-checkpoints: 2
periodic-compaction-threshold-days: 1
smooth: true
How is the progress now? I encountered the same problem. Is there a solution?
Is the TPS performance improved? https://suiscan.xyz/mainnet/analytics/cps
For folks having memory growth issues, can you follow https://gist.github.com/mwtian/0f473325a1ad5a74982fcf91737653b4 and upload the heap profile (and metrics if there are interesting findings)? cc @AndyCYB @zhy827827
sui-oom.txt I have collected the data and I don't know if it is useful sui-monitored.txt
Thanks a lot @zhy827827. Is it possible to take the memory profile as well?
I am still Learn how to get the document of memory files, and I will not use it yet
And to confirm, is your fullnode running in asia?
yes!
Interesting. We saw another instance of memory growth from fullnodes running in Asia as well.
Yes, we have two servers, one with 128GB of RAM and one with 64GB of RAM. Servers with 64GB of RAM haven't been able to run at all recently because they've been on the oom
@zhy827827 we are still investigating. Can you share your config under authority-store-pruning-config:?
Hello, this issue has been resolved. There has been no room kill now. The reason is that TPS has decreased and returned to its previous level
authority-store-pruning-config:
num-latest-epoch-dbs-to-retain: 3
epoch-db-pruning-period-secs: 3600
num-epochs-to-retain: 0
max-checkpoints-in-batch: 10
max-transactions-in-batch: 1000
#use-range-deletion: true
pruning-run-delay-seconds: 60
num-epochs-to-retain-for-checkpoints: 2
periodic-compaction-threshold-days: 1
smooth: true
Sounds good! Thanks for sharing the config.