sui icon indicating copy to clipboard operation
sui copied to clipboard

Sui oom-kill

Open zhy827827 opened this issue 1 year ago • 12 comments

After updating the Sui new version, the sui node is very unstable and often experiences oom kill Previously, servers with 64GB of memory could run smoothly, but now servers with 128GB of memory are all oom-kill

Oct  8 22:16:11 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct  9 17:41:34 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 00:48:31 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 05:27:15 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 09:35:03 rockx-mainnet-merlin-sg-01 kernel: [15035328.923887] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 10 09:35:07 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 12:53:18 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 16:42:08 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 18:58:15 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 19:56:33 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 21:18:20 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 22:12:38 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 23:03:57 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 10 23:56:50 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 00:30:36 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 01:01:02 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 01:53:52 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:04:53 rockx-mainnet-merlin-sg-01 kernel: [15098318.993834] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:04:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:34:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 03:58:24 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 04:55:38 rockx-mainnet-merlin-sg-01 kernel: [15104964.510899] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 05:03:19 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 05:38:30 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 06:15:05 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 07:51:47 rockx-mainnet-merlin-sg-01 kernel: [ 1883.157041] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 07:51:53 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 08:19:59 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 09:06:43 rockx-mainnet-merlin-sg-01 kernel: [ 6378.583779] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL
Oct 11 10:36:48 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 13:30:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 15:59:22 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 17:34:36 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 18:42:59 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 19:31:25 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 20:19:29 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 21:49:21 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 11 23:28:27 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL
Oct 12 01:08:12 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL

Oct 12 03:13:51 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: A process of this unit has been killed by the OOM killer. Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Main process exited, code=killed, status=9/KILL Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Failed with result 'oom-kill'. Oct 12 03:13:56 rockx-mainnet-merlin-sg-01 systemd[1]: sui.service: Consumed 1h 26min 28.180s CPU time.

zhy827827 avatar Oct 12 '24 02:10 zhy827827

@zhy827827 which version are you trying to run?

stefan-mysten avatar Oct 12 '24 04:10 stefan-mysten

@stefan-mysten Run SUI can only use latest

full.yml:

authority-store-pruning-config:
  num-latest-epoch-dbs-to-retain: 3
  epoch-db-pruning-period-secs: 3600
  num-epochs-to-retain: 0
  max-checkpoints-in-batch: 10
  max-transactions-in-batch: 1000
  #use-range-deletion: true
  pruning-run-delay-seconds: 60
  num-epochs-to-retain-for-checkpoints: 2
  periodic-compaction-threshold-days: 1
  smooth: true

zhy827827 avatar Oct 12 '24 06:10 zhy827827

How is the progress now? I encountered the same problem. Is there a solution?

AndyCYB avatar Oct 13 '24 01:10 AndyCYB

Is the TPS performance improved? https://suiscan.xyz/mainnet/analytics/cps

zhy827827 avatar Oct 14 '24 07:10 zhy827827

For folks having memory growth issues, can you follow https://gist.github.com/mwtian/0f473325a1ad5a74982fcf91737653b4 and upload the heap profile (and metrics if there are interesting findings)? cc @AndyCYB @zhy827827

mwtian avatar Oct 14 '24 22:10 mwtian

sui-oom.txt I have collected the data and I don't know if it is useful sui-monitored.txt

zhy827827 avatar Oct 15 '24 03:10 zhy827827

Thanks a lot @zhy827827. Is it possible to take the memory profile as well?

mwtian avatar Oct 15 '24 03:10 mwtian

I am still Learn how to get the document of memory files, and I will not use it yet

zhy827827 avatar Oct 15 '24 03:10 zhy827827

And to confirm, is your fullnode running in asia?

mwtian avatar Oct 15 '24 03:10 mwtian

yes!

zhy827827 avatar Oct 15 '24 03:10 zhy827827

Interesting. We saw another instance of memory growth from fullnodes running in Asia as well.

mwtian avatar Oct 15 '24 03:10 mwtian

Yes, we have two servers, one with 128GB of RAM and one with 64GB of RAM. Servers with 64GB of RAM haven't been able to run at all recently because they've been on the oom

zhy827827 avatar Oct 15 '24 04:10 zhy827827

@zhy827827 we are still investigating. Can you share your config under authority-store-pruning-config:?

mwtian avatar Oct 17 '24 18:10 mwtian

Hello, this issue has been resolved. There has been no room kill now. The reason is that TPS has decreased and returned to its previous level

authority-store-pruning-config:
  num-latest-epoch-dbs-to-retain: 3
  epoch-db-pruning-period-secs: 3600
  num-epochs-to-retain: 0
  max-checkpoints-in-batch: 10
  max-transactions-in-batch: 1000
  #use-range-deletion: true
  pruning-run-delay-seconds: 60
  num-epochs-to-retain-for-checkpoints: 2
  periodic-compaction-threshold-days: 1
  smooth: true

zhy827827 avatar Oct 18 '24 02:10 zhy827827

Sounds good! Thanks for sharing the config.

mwtian avatar Oct 18 '24 04:10 mwtian