mx-chain-go icon indicating copy to clipboard operation
mx-chain-go copied to clipboard

[BUG] DEVNET D1.3.31 kernel: Out of memory: Killed process 13390 (node)

Open frankf1957 opened this issue 2 years ago • 23 comments

Executing details
Application version string:  Elrond Node CLI App version D1.3.31.0-0-gde0e7e887/go1.17.6/linux-amd64/be36dd5835
Hosting VPS:  Hetzner.com/cloud - CPX51 - 16 vCPU - 32G memory - 240 G disk 

Running node process was killed by kernel oom killer.

$ sudo grep -i oom /var/log/syslog.1
Jul  4 01:53:32 elrond-devnet-en6-hel1-1 kernel: [123650.236094] node invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Jul  4 01:53:32 elrond-devnet-en6-hel1-1 kernel: [123650.236117]  oom_kill_process.cold+0xb/0x10
Jul  4 01:53:32 elrond-devnet-en6-hel1-1 kernel: [123650.236261] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jul  4 01:53:32 elrond-devnet-en6-hel1-1 kernel: [123650.236344] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/elrond-node-0.service,task=node,pid=13390,uid=1001
Jul  4 01:53:32 elrond-devnet-en6-hel1-1 kernel: [123650.236492] Out of memory: Killed process 13390 (node) total-vm:27668616kB, anon-rss:23623240kB, file-rss:0kB, shmem-rss:4kB, UID:1001 pgtables:46832kB oom_score_adj:0
Jul  4 01:53:33 elrond-devnet-en6-hel1-1 kernel: [123651.494429] oom_reaper: reaped process 13390 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

I am running 6 DEVNET nodes on this system, so the combined size of the log files may be too big for an upload to this issue. I have extracted the logs from their archive and made a compressed tar file of each individual log.
I will upload the compressed tar file for node-0 since that is the one that was in trouble. If you require the other logs please update the issue and I will upload them too.

Kind regards, frankf1957

frankf1957 avatar Jul 05 '22 16:07 frankf1957

Does those nodes are used in resolving API requests? If yes, there are known issues that are fixed in the upcoming v1.3.33 binary release: https://github.com/ElrondNetwork/elrond-go/releases/tag/v1.3.33 ETA for the proper release: this week: 04-10.07.2022

iulianpascalau avatar Jul 05 '22 17:07 iulianpascalau

These are validator nodes only. These are not full-copy nodes either. These nodes are not used to resolve API requests. I am running these nodes in support of the development community, I do not do any development myself using these nodes.

frankf1957 avatar Jul 05 '22 17:07 frankf1957

strange, the log does not contain any reference to a panic (usually we get this logged). The last line states that the heap in-use is ~740MB and the heap idle is ~4.73GB. Heap idle can be reclaimed by the OS at any time.

Jul 05 16:38:59 elrond-devnet-en6-hel1-1 node[40876]: DEBUG[2022-07-05 16:38:59.208] [common/statistics]  [1/1177/1414589/(END_ROUND)] node statistics                          uptime = 2h22m3s timestamp = 1657039139 num go = 1195 alloc = 478.40 MB heap alloc = 478.40 MB heap idle = 4.73 GB heap inuse = 740.12 MB heap sys = 5.45 GB heap num objs = 1744127 sys mem = 5.78 GB num GC = 9817 FDs = 330 num opened files = 329 num conns = 88 accountsTrieDbMem = 0 B evictionDbMem = 0 B peerTrieDbMem = 0 B peerTrieEvictionDbMem = 0 B

iulianpascalau avatar Jul 05 '22 17:07 iulianpascalau

my devnet machine that hosts 4 nodes takes around 7.9GB as claimed by the htop utility program. The real usage is lower, most certainly. Since the OS does not perform any other tasks, and thus, does not need RAM for something else, it does not bother to reclaim the heap from the heap idle area.

iulianpascalau avatar Jul 05 '22 17:07 iulianpascalau

I will change my systemd unit files to add --log-save to cause the nodes to log to disk. If this happens again we may find details in the disk logs that do not get recorded to the systemd journal.

I will update this issue if the problem occurs again.

frankf1957 avatar Jul 05 '22 17:07 frankf1957

Great! Can you also add the flag -profile-mode ? That will automatically create a new directory called healthRecords near the node's binary and in that directory will create profile files so we can check to trace some OOM root causes.

iulianpascalau avatar Jul 05 '22 17:07 iulianpascalau

Done and nodes restarted.

The ExecStart directive in the systemd unit files are all set like this one now:

ExecStart=/home/elrond/elrond-nodes/node-0/node -log-save -profile-mode -log-logger-name -log-correlation -log-level *:DEBUG -rest-api-interface localhost:8080

frankf1957 avatar Jul 05 '22 17:07 frankf1957

Looks good :+1: Thanks.

iulianpascalau avatar Jul 05 '22 18:07 iulianpascalau

I have another instance of a running node killed by the the kernel oom killer.

Messages recorded in /var/log/syslog:

Jul  7 21:52:20 elrond-devnet-en6-hel1-1 kernel: [454779.767489] apps.plugin invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Jul  7 21:52:20 elrond-devnet-en6-hel1-1 kernel: [454779.767515]  oom_kill_process.cold+0xb/0x10
Jul  7 21:52:20 elrond-devnet-en6-hel1-1 kernel: [454779.767669] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Jul  7 21:52:20 elrond-devnet-en6-hel1-1 kernel: [454779.767743] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/elrond-node-4.service,task=node,pid=54480,uid=1001
Jul  7 21:52:20 elrond-devnet-en6-hel1-1 kernel: [454779.767888] Out of memory: Killed process 54480 (node) total-vm:26826588kB, anon-rss:23219180kB, file-rss:0kB, shmem-rss:4kB, UID:1001 pgtables:46008kB oom_score_adj:0
Jul  7 21:52:22 elrond-devnet-en6-hel1-1 kernel: [454781.563224] oom_reaper: reaped process 54480 (node), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

I am attaching the log for node-4 for this occurrence. elrond-go-2022-07-07-17-48-29.log.tar.gz

frankf1957 avatar Jul 09 '22 16:07 frankf1957

Thank you for the logs! It seems like the validator went through a shuffle out from shard 2 to shard 1. During the syncing of the trie for the new shard, it failed with out of memory. We will look into it and try to reproduce. Also, I assume your machine has enough resources (RAM in particular), right? In order to eliminate this option

bogdan-rosianu avatar Jul 11 '22 14:07 bogdan-rosianu

Hi @bogdan-rosianu - Should be plenty of resources - this machine has 32G Ram and 16 vCPUs, no swap configured:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          31354       13225        1873           1       16255       17665
Swap:             0           0           0

$ nproc
16

frankf1957 avatar Jul 15 '22 04:07 frankf1957

Hi @frankf1957 There is a tentative fix for out of memory during trie sync, planned for the July release, merged with the PR: #4248.

AdoAdoAdo avatar Jul 15 '22 09:07 AdoAdoAdo

Hi @AdoAdoAdo I am still experiencing this issue with my DEVNET nodes.

When I reported this issue the node version was D1.3.31. It was upgraded to D1.3.36 as soon as the update was available.

I have collected validator log files, and also extracted the kernel oom killer info from syslog.

My executing details are: Application version string: Elrond Node CLI App version D1.3.36.0-0-3ec53be47a/go1.17.6/linux-amd64/be36dd5835 Hosting VPS: Hetzner.com/cloud - CPX51 - 16 vCPU - 32G memory - 240 G disk

Kind regards.

frankf1957 avatar Aug 18 '22 18:08 frankf1957

I got a "File size too big" error trying to upload the node logs tar.gz file. I have unpacked them, and will upload individually.

frankf1957 avatar Aug 18 '22 18:08 frankf1957

Hi @AdoAdoAdo I am still experiencing this issue with my DEVNET nodes.

When I reported this issue the node version was D1.3.31. It was upgraded to D1.3.36 as soon as the update was available.

I have collected validator log files, and also extracted the kernel oom killer info from syslog.

My executing details are: Application version string: Elrond Node CLI App version D1.3.36.0-0-3ec53be47a/go1.17.6/linux-amd64/be36dd5835 Hosting VPS: Hetzner.com/cloud - CPX51 - 16 vCPU - 32G memory - 240 G disk

Kind regards.

Hi @frankf1957 sorry I was not clear in my previous message, the fix is already part of #4245 (#4248 was merged into it) which is the next planned release, and going through last internal testing.

It was initially planned for July but there were some issues that took longer to fix, so the release got delayed.

AdoAdoAdo avatar Aug 22 '22 12:08 AdoAdoAdo

Closing this issue as the fix will be included in the upcoming rc/2022-july release.

frankf1957 avatar Sep 20 '22 13:09 frankf1957