nethermind icon indicating copy to clipboard operation
nethermind copied to clipboard

[Client][Mainnet] Nethermind client reached 95% RAM memory and crash

Open AliakseiMalyshau opened this issue 3 years ago • 30 comments

Describe the bug

Client:

  • Nethermind client: v1.12.3
  • Running type: Docker
  • Run command: --config mainnet --HealthChecks.Enabled true --datadir /nethermind/data --log DEBUG --JsonRpc.Enabled true --JsonRpc.Host 0.0.0.0 --JsonRpc.Port 8545 --Pruning.Enabled true --Pruning.CacheMb 4096 --Init.MemoryHint 2048000000

System:

  • OS: Ubuntu 20.04.3 LTS
  • RAM: 16GB
  • CPU: 4 vCPU

Issue: all system parameters described above. When the service starts, the RAM starts to grow exponentially. After 95% RAM usage service crash and trying to restart. Sometimes service restart successfully and continuing to sync, but often can't start again.

Logs: last 50 lines

2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|47.206.176.63:30303|Geth/v1.10.13-stable-7a0c19f8/windows-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|76.89.226.39:30303|Geth/v1.10.13-stable-7a0c19f8/windows-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|144.232.178.99:53346|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|51.81.167.147:30303|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.3|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|162.55.101.40:2010|Geth/v1.10.14-stable-11a3a350/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|82.154.24.30:30303|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|5.189.128.226:30303|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|45.76.227.89:30303|Nethermind/v1.12.1-0-563f0a36f-20220106/X64-Linux/6.0.0-rtm.21522.10|wit0] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|95.216.67.178:51976|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|136.243.35.117:30303|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|38.68.134.201:30303|Geth/v1.10.14-unstable-db03faa1-20211207/linux-amd64/go1.15.6|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|150.230.47.127:30303|Geth/v1.10.13-stable/linux-amd64/go1.16.7|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|3.235.194.35:30303|Nethermind/v1.10.79-0-e45db5fb5-20211209/X64-Linux/5.0.8|wit0] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|172.112.18.97:30306|Nethermind/v1.12.3-0-f5f0d940a-20211231/X64-Linux/6.0.1|wit0] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|139.99.120.33:30303|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|13.209.16.169:30303|Geth/v1.10.12-stable/linux-amd64/go1.16.7|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|157.90.130.215:30322|Geth/v1.10.14-stable-11a3a350/linux-amd64/go1.17|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|167.71.97.19:30303|Geth/v1.10.13-stable/linux-amd64/go1.16.10|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4059|No pong received in response to the 19:19:47 ping at [Node|121.50.44.44:30303|Nethermind/v1.12.3-0-f5f0d940a-20220109/X64-Windows/6.0.1|eth66] | last pong time 19:18:28 
2022-01-09 19:20:02.4072|No pong received in response to the 19:19:47 ping at [Node|18.179.112.170:39418|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:05.6661|Saving data in /nethermind/data/nethermind_db/mainnet/discoveryNodes/SimpleFileDb.db | backup stored in /nethermind/data/nethermind_db/mainnet/discoveryNodes/SimpleFileDb.db_09b6cc14-0549-4649-a877-e8e6f3908e20 
2022-01-09 19:20:08.4823|No pong received in response to the 19:19:47 ping at [Node|51.77.117.30:30303|Geth/v1.10.15-stable-1a7ae2c5/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:10.1543|100 % of nodes did not respond to a Ping message - 100/100 
2022-01-09 19:20:10.3729|Changing state WaitingForBlock to WaitingForBlock at processed:13973172|state:13973172|block:13973173|header:13973173|peer block:13973173 
2022-01-09 19:20:22.4120|Dropped 0 useless peers 
2022-01-09 19:20:22.6237|Broadcasting transactions to all peers 
2022-01-09 19:20:34.5246|Broadcasting transactions to all peers 
2022-01-09 19:20:44.8013|Broadcasting transactions to all peers 
2022-01-09 19:20:46.8319|No pong received in response to the 19:20:34 ping at [Node|157.90.130.215:30322|Geth/v1.10.14-stable-11a3a350/linux-amd64/go1.17|eth66] | last pong time 19:18:28 
2022-01-09 19:20:48.3228|No pong received in response to the 19:20:34 ping at [Node|167.71.97.19:30303|Geth/v1.10.13-stable/linux-amd64/go1.16.10|eth66] | last pong time 19:18:28 
2022-01-09 19:20:48.3227|No pong received in response to the 19:20:34 ping at [Node|184.59.227.121:30303|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:50.5405|No pong received in response to the 19:20:34 ping at [Node|121.50.44.44:30303|Nethermind/v1.12.3-0-f5f0d940a-20220109/X64-Windows/6.0.1|eth66] | last pong time 19:18:28 
2022-01-09 19:20:50.8935|No pong received in response to the 19:20:34 ping at [Node|13.209.16.169:30303|Geth/v1.10.12-stable/linux-amd64/go1.16.7|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|159.138.93.135:54178|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|51.77.117.30:30303|Geth/v1.10.15-stable-1a7ae2c5/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|18.179.112.170:39418|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|136.243.34.42:30303|Geth/v1.10.14-stable-6d31a8cc/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|125.176.105.21:30313|Geth/v1.10.14-stable-11a3a350/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|69.172.179.183:46151|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.3|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|35.195.59.29:50932|Geth/v1.10.15-stable-8be800ff/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|8.209.68.154:30303|Geth/v1.10.14-stable-11a3a350/linux-amd64/go1.17.5|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|3.92.233.74:35810|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.16.8|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|52.193.254.254:15175|Geth/v1.10.12-stable-6c4dc6c3/linux-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|47.252.39.69:30303|Nethermind/v1.12.2-3-6a3c5d803-20211230/X64-Linux/6.0.0-rtm.21522.10|wit0] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|51.89.175.159:30303|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|3.89.142.29:30303|Geth/v1.10.13-stable-7aa91b75/linux-amd64/go1.16.3|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|90.189.216.29:30303|Nethermind/v1.12.2-0-5bfaae6bc-20220108/X64-Linux/6.0.0-rtm.21522.10|wit0] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|65.108.77.136:30303|Geth/v1.10.13-stable-7a0c19f8/linux-amd64/go1.17.2|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|47.241.45.159:30303|Geth/v1.10.13-stable/linux-amd64/go1.16.3|eth66] | last pong time 19:18:28 
2022-01-09 19:20:51.1538|No pong received in response to the 19:20:34 ping at [Node|47.89.184.210:30303|Nethermind/v1.12.2-3-6a3c5d803-20211230/X64-Linux/6.0.0-rtm.21522.10|wit0] | last pong time 19:18:28

Unfortunately in logs I don't have any errors, critical stuff, etc. Service just stopped because OOM.

To Reproduce Steps to reproduce the behaviour:

  1. Start Nethermind client with configuration above, start sync node without any errors
  2. After 1 - 1.5 days 95% RAM usage and service down

Expected behaviour Client works correctly without any OOM and crash stuff.

Screenshots image

Desktop (please complete the following information): Described above.

Additional context In my understanding Nethermind client is working on .NET (C#) via CLR. Probably CRL is writing some garbage to RAM memory and when RAM space is close to end CLR clear garbage to free up memory. Maybe I wrong.

Thanks for the help!

AliakseiMalyshau avatar Jan 10 '22 07:01 AliakseiMalyshau

Hello! Any updates here?

I had some quick fix from VM side (testing...):

sysctl -w vm.drop_caches=3
echo "* * * * * sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" >> /var/spool/cron/crontabs/root

Maybe it'll help. Thanks!

AliakseiMalyshau avatar Jan 19 '22 07:01 AliakseiMalyshau

Two things that look unusual about you config are: --Pruning.CacheMb 4096 --Init.MemoryHint 2048000000 You want 4GB of cache but hint the app to use only 2GB of RAM. I wasn't able to replicate this on my environment, will try under Ubuntu+docker. Might be helpful to post full logs.

LukaszRozmej avatar Jan 19 '22 08:01 LukaszRozmej

@LukaszRozmej thanks for the reply! My last logs: mainnet.logs.txt

Also I saw interesting behaviour: when I started drop cache via crontab my client increased CPU load (screen below)

Screenshot 2022-01-19 at 12 26 56

Previous behaviour without drop cache:

image

After some RAM spikes CPU load was increased.

AliakseiMalyshau avatar Jan 19 '22 09:01 AliakseiMalyshau

Sorry, forgot to mention - I increased my RAM memory on VM on 24 GB and add swap file 4 GB.

For now I have next configuration:

  • 4 vCPU
  • 24 GB RAM (+ 4 GB swap)

Load:

  • ~75 - 80% CPU load
  • 80% RAM load
free -h
                  total        used        free      shared  buff/cache   available
Mem:               23Gi        18Gi       4.3Gi       0.0Ki       394Mi       4.4Gi
Swap:              4.0Gi       2.2Gi       1.8Gi

AliakseiMalyshau avatar Jan 19 '22 09:01 AliakseiMalyshau

@matilote can you comment on linux caches? I'm not linux expert. Also it shows 4.3Gi under used memory?

LukaszRozmej avatar Jan 19 '22 13:01 LukaszRozmej

@LukaszRozmej but anyway which configuration is better to avoid RAM increasing for my VM? Thanks!

AliakseiMalyshau avatar Jan 19 '22 13:01 AliakseiMalyshau

i've had this happen to me many time in the last few months, CPU and or RAM taking all there is even crashing 32 core, 128GB RAM dedicated servers, RAM use over 95GB, been looking for anyone else with this issue. I run nethermind in docker on a debian based system and have experienced the same issues

alexpeterson91 avatar Feb 10 '22 21:02 alexpeterson91

I'm done.. Deleted Nethermind node, using stable Erigon client.

AliakseiMalyshau avatar Feb 16 '22 15:02 AliakseiMalyshau

It's a nethermind issue affecting all users of nethermind been going on a while and there's still no fix. Geth remains the most stable ETH 1 client in general but erigon is coming into its own and client diversity is very important.

alexpeterson91 avatar Mar 01 '22 10:03 alexpeterson91

@alexpeterson91 we are sorry for the issues. There are multiple circumstances slowing us down on this issue (issues with memory diagnostic tools after .NET 6 upgrade, a lot of other work like snap sync and the merge and currently even direct help for Ukrainian refugees - a lot of the core team is from Poland).

Still wanted to assure that this is being worked on and will be resolved.

Anyone can help with that by sending us memory information when they encounter memory issue by using one of two tools: https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-gcdump

Unfortunately we didn't get any community ones yet and the issues don't manifest often on our VM's.

LukaszRozmej avatar Mar 01 '22 12:03 LukaszRozmej

@alexpeterson91 we are sorry for the issues. There are multiple circumstances slowing us down on this issue (issues with memory diagnostic tools after .NET 6 upgrade, a lot of other work like snap sync and the merge and currently even direct help for Ukrainian refugees - a lot of the core team is from Poland).

Still wanted to assure that this is being worked on and will be resolved.

Anyone can help with that by sending us memory information when they encounter memory issue by using one of two tools:

https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html

https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-gcdump

Unfortunately we didn't get any community ones yet and the issues don't manifest often on our VM's.

Thank you for reaching out an reopening this issue. And while we are an apolitical organization at DAppNode we certainly support freedom and your team's direct support of the crisis and conflict in Ukraine. I'm happy to try and run a memory logger as I have many machines running nethermind in docker and most if not all have experienced the memory leak issue even on dedicated servers taking up 90+ GB of RAM. I'll try to get some logs captured and share em.

alexpeterson91 avatar Mar 05 '22 21:03 alexpeterson91

@LukaszRozmej Would be happy to try and gather logs since I have this issue occur commonly I just am not familiar with this tool https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html and don't know what info you want specifically. In all my machines I run nethermind in a docker container so do I just install the tool on the root system and have it monitor the nethermind container? Or would I need to rebuild the docker container with the tool inside the container to properly gather logs for you? @Tropicar

alexpeterson91 avatar Mar 05 '22 23:03 alexpeterson91

We have to check if with the new version of nethermind client solves the ram issues, anyway I am researching what is the best way to get the info with this tools in our environment, i have never used before

tropicar avatar Mar 08 '22 15:03 tropicar

It happened to me today. Nethermind version 1.13.3 Ubuntu 20.04 LTS, 64GB RAM. I attached log file hours before and after the crash/restart. mainnet.log Also, grafana chart around the crash/restart: image image

73649 avatar Jun 28 '22 01:06 73649

My issue never went away. Still can take up all ram and CPU and crash docker taking my system down. Some containers self heal and reboot but critical ones often go down and need docker properly restarted or just a full reboot. Even on overpowered machines running basicslly nothing but nethermind no RPC or WS calls just the node running and it builds and builds then comes crashing down after overloading the system. I can't continue testing on more than one machine anymore.

alexpeterson91 avatar Jun 28 '22 02:06 alexpeterson91

Hi @alexpeterson91, we currently have @kkokosa looking into memory issues. Hopefully we can have some improvements soon.

LukaszRozmej avatar Jun 28 '22 09:06 LukaszRozmej

Can we get an update on this issue?

pk27734 avatar Aug 16 '22 06:08 pk27734

i dont think there's been progress, we solved the problem for our DAppNode users by implementing resource limits to keep nethermind from using to much RAM, only docker package of nearly 200 web3apps with this issue and it's persisted well all year so far for me at least, until we just put in ulimits to the docker containers

alexpeterson91 avatar Aug 16 '22 10:08 alexpeterson91

@LukaszRozmej @dceleda I would like to follow-up on this issue since I have kinda similar behavior.

So I'm running EL + Prysm CL/Validator for more than 24 hours in full sync and see a slight downtrend in terms of available RAM.

my specs: Azure hosted VM, 4 vCPU, 16 GiB RAM, 2TB SSD (7200 IOPS, 250 MB/sec throughput)

running Nethermind in a Docker with such config:

sudo docker run -t -d \
-v $HOME/nethermind:/var/lib/nethermind \
-v $HOME/jwtsecret:/var/lib/jwtsecret \
--network host \
--name nethermind nethermind/nethermind:latest \
--config mainnet \
--datadir /var/lib/nethermind \
--Sync.SnapSync true \
--Sync.AncientBodiesBarrier 11052984 \
--Sync.AncientReceiptsBarrier 11052984 \
--HealthChecks.Enabled true \
--JsonRpc.Enabled true \
--JsonRpc.EngineHost 0.0.0.0 \
--JsonRpc.EnginePort 8551 \
--JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex \
--JsonRpc.AdditionalRpcUrls "http://localhost:8551|http;ws|net;eth;subscribe;engine;web3;client" \
--Pruning.Mode=Full \
--Merge.Enabled true

Resource utilization for last 24 hours:

Disk IO bandwidth (MAX) image

Disk IOPS (MAX) image

Network In/Out total, MB per minute (MAX) image

Available RAM, GiB (MIN) image

As you can see there was a spike (going up) due to Nethermind restart.

Just for reference, same VM running Prysm CL/Validator only on a 8 GiB RAM previous week had no downtrend (data for 3 days) image

Please let me know what should I do as workaround some cache options tuning etc. Thanks

nazar-kuzo avatar Sep 12 '22 16:09 nazar-kuzo

@LukaszRozmej

Anyone can help with that by sending us memory information when they encounter memory issue by using one of two tools: https://www.jetbrains.com/help/dotmemory/Working_with_dotMemory_Command-Line_Profiler.html https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-gcdump

Unfortunately we didn't get any community ones yet and the issues don't manifest often on our VM's.

I would like to prepare the memory dump for you but I'm having a hard times with collecting it from Docker container.

Could you please prepare a step by step guide on what should I do in order to collect it from Nethermind running in Docker container, then I'll happily do it for you.

As of now, I installed all the tools on my machine but it still complains that there is no process that supports the .net version, based on your repo, you are running the .net 6 and that what I have tried, but it didnt work.

Based on Microsoft docks there might be some additional steps performed to collect it, but I dont want to play with it on my only working mainnet setup until I'm sure that it would work

Microsoft docs reference

nazar-kuzo avatar Sep 12 '22 19:09 nazar-kuzo

@LukaszRozmej

Okay, I managed to do that on my own.

Reference

  • Download Microsoft "dotnet-dump" Docker image sudo docker pull mcr.microsoft.com/dotnet/monitor:latest

  • Restart Nethermind Docker image with shared folder parameter that exposes IPC -v /tmp/container_sockets:/tmp

  • Run dotnet-monitor Docker container

sudo docker run -t -d \
-v /tmp/container_sockets:/tmp \
-e ASPNETCORE_URLS=http://+:52323 \
--network host \
--name monitor mcr.microsoft.com/dotnet/monitor:latest \
--no-auth
  • Run curl to find the Nethermind process UID curl -i -H "Accept: application/json" -X GET http://localhost:52323/processes

  • Example response:

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Mon, 12 Sep 2022 20:39:25 GMT
Server: Kestrel
Transfer-Encoding: chunked

[{"pid":1,"uid":"62e846d8-db2b-4a38-a65c-e4891677dc21","name":"Nethermind.Runner","isDefault":true}]
  • Run curl to collect Nethermind process dump (docs here) curl --output $HOME/nethermind.dump -i -X GET http://localhost:52323/dump?uid=<nethermind-uid>&type=Full

  • Keep in mind that while dump is being collected Nethermind app goes to "Disconnected" mode and freezes, so you can see such logs in it

[createdump] Gathering state for process 1 Nethermind.Runn
[createdump] Writing minidump with heap to file /tmp/86917fcf-3d34-449d-9f01-bd374f84ec24_1
[createdump] Written 7657054208 bytes (1869398 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written
2022-09-12 20:27:39.1403|Waiting for peers... 9s
2022-09-12 20:27:40.1405|Waiting for peers... 10s
2022-09-12 20:27:41.1401|Waiting for peers... 11s
2022-09-12 20:27:42.1405|Peers | with known best block: 2 | all: 2 |
2022-09-12 20:27:42.1405|Waiting for peers... 12s
2022-09-12 20:27:42.1434|Changing state Disconnected to Full at processed:15522874|state:15522874|block:15522874|header:15522874|chain difficulty:58573937956890884427703|peer block:15522876|peer total difficulty:58573962405892214866949
  • Download file scp command

nazar-kuzo avatar Sep 12 '22 20:09 nazar-kuzo

@LukaszRozmej Since I had to restart Nethermind to enable sidecar diagnostics it runs the possibility to collect "interesting" dump for you, So I'll get back to you here tomorrow once I'm running low on RAM.

Please let me know which types of dump do you need? dump - heap/full? GC dump??

nazar-kuzo avatar Sep 12 '22 20:09 nazar-kuzo

@kkokosa could you also help here?

dceleda avatar Sep 13 '22 09:09 dceleda

Hi @nazar-kuzo, I'll jump in helping to investigate this issue.

As you have already configured dotnet-monitor, which is great, now we are ready to diagnose. First of all, what I would like to see is a trace recorded periodically with the help of /trace?pid={pid}?durationSeconds=300&profile=cpu endpoint. Could you run such 5minutes long session several times every hour, remembering also the time so we can correlate with the Available RAM graph? Thanks!

kkokosa avatar Sep 13 '22 21:09 kkokosa

@kkokosa I did a test run with this command curl -o nethermind.nettrace http://localhost:52323/trace?durationSeconds=300&profile=cpu&pid=1 "Timestamp":"2022-09-13T23:20:00Z"

Which gave me the 172MB file, by default it is in nettrace format, so I was able to open it in VS 2022 on Windows image

Here is link to it https://1drv.ms/u/s!Ai5XSIx7haA0k99fEl8xwNZQumR65A?e=BOMli1

Please give it a try and let me know whether you can open it, then I'll continue collecting.

Available RAM during that time: image

nazar-kuzo avatar Sep 13 '22 23:09 nazar-kuzo

@kkokosa Another trace https://1drv.ms/u/s!Ai5XSIx7haA0k99gJYCReLdRFdXz-g?e=cgGICm

"Timestamp":"2022-09-14T09:32:00Z"

Available RAM during that time: image

sudo docker stats image

nazar-kuzo avatar Sep 14 '22 09:09 nazar-kuzo

@kkokosa Yet another trace https://1drv.ms/u/s!Ai5XSIx7haA0k99hergSq2-GFj8b3w?e=HXOsTG

"Timestamp":"2022-09-14T11:02:00Z"

Available RAM during that time: image

nazar-kuzo avatar Sep 14 '22 11:09 nazar-kuzo

Thank you very much, on it. Need some time to look at them and I will return with next steps.

kkokosa avatar Sep 14 '22 17:09 kkokosa

@kkokosa you are always welcome, also I would like to add that based on my observations, RAM fluctuations are not always reflected in used RAM by Nethermind (based on docker container stats) but often in "buff/cache" RAM based on free -m command, I did such observations yesterday during the day.

So it might be issue related to RAM utilization and recycling rather than memory leaking etc., please look output below, probably it would help you to come to some other ideas/thoughts.

2022-09-13T12:30:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    155.64%   8.255GiB / 15.63GiB   52.82%    0B / 0B   328GB / 136GB    67
744ba7fb924a   beacon-node   17.58%    4.069GiB / 15.63GiB   26.04%    0B / 0B   10.9GB / 234GB   19
4257e099dca9   validator     0.10%     61.83MiB / 15.63GiB   0.39%     0B / 0B   313MB / 46.5MB   11
3cb870e2e188   mev-boost     0.00%     14.02MiB / 15.63GiB   0.09%     0B / 0B   28.6MB / 0B      9

2022-09-13T13:00:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O       PIDS
e4cabe21cdc2   nethermind    75.55%    8.272GiB / 15.63GiB   52.93%    0B / 0B   343GB / 144GB   66
744ba7fb924a   beacon-node   75.85%    4.028GiB / 15.63GiB   25.77%    0B / 0B   11GB / 237GB    19
4257e099dca9   validator     0.10%     62.54MiB / 15.63GiB   0.39%     0B / 0B   320MB / 47MB    11
5fe726a80a2a   mev-boost     0.00%     8.586MiB / 15.63GiB   0.05%     0B / 0B   549kB / 0B      7
              total        used        free      shared  buff/cache   available
Mem:          16002       12830         162           1        3009        2834

2022-09-13T14:00:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    46.53%    8.264GiB / 15.63GiB   52.88%    0B / 0B   361GB / 154GB    66
744ba7fb924a   beacon-node   5.49%     3.976GiB / 15.63GiB   25.44%    0B / 0B   11.2GB / 241GB   19
4257e099dca9   validator     0.10%     61.79MiB / 15.63GiB   0.39%     0B / 0B   325MB / 47.9MB   11
5fe726a80a2a   mev-boost     0.00%     9.086MiB / 15.63GiB   0.06%     0B / 0B   1.84MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       12765         160           1        3075        2897

2022-09-13T14:30:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    62.71%    8.561GiB / 15.63GiB   54.78%    0B / 0B   375GB / 160GB    67
744ba7fb924a   beacon-node   6.55%     4.282GiB / 15.63GiB   27.40%    0B / 0B   11.4GB / 245GB   19
4257e099dca9   validator     0.12%     62.43MiB / 15.63GiB   0.39%     0B / 0B   330MB / 48.7MB   11
5fe726a80a2a   mev-boost     0.00%     9.285MiB / 15.63GiB   0.06%     0B / 0B   1.95MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       13103         241           1        2658        2568

2022-09-13T15:30:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    37.34%    8.47GiB / 15.63GiB    54.20%    0B / 0B   389GB / 165GB    65
744ba7fb924a   beacon-node   38.26%    4.096GiB / 15.63GiB   26.21%    0B / 0B   11.5GB / 249GB   19
4257e099dca9   validator     0.17%     62.05MiB / 15.63GiB   0.39%     0B / 0B   333MB / 49.6MB   11
5fe726a80a2a   mev-boost     0.00%     13.4MiB / 15.63GiB    0.08%     0B / 0B   2.27MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       12769         319           1        2913        2891

2022-09-13T16:30:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    237.93%   8.298GiB / 15.63GiB   53.10%    0B / 0B   411GB / 175GB    67
744ba7fb924a   beacon-node   2.00%     4.423GiB / 15.63GiB   28.30%    0B / 0B   11.8GB / 254GB   19
4257e099dca9   validator     0.09%     62.32MiB / 15.63GiB   0.39%     0B / 0B   338MB / 50.7MB   11
5fe726a80a2a   mev-boost     0.00%     13.42MiB / 15.63GiB   0.08%     0B / 0B   2.43MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       13209         160           1        2632        2450

2022-09-13T17:30:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    34.51%    8.526GiB / 15.63GiB   54.56%    0B / 0B   427GB / 182GB    65
744ba7fb924a   beacon-node   40.82%    4.419GiB / 15.63GiB   28.28%    0B / 0B   11.9GB / 258GB   19
4257e099dca9   validator     0.15%     61.78MiB / 15.63GiB   0.39%     0B / 0B   353MB / 51.5MB   11
5fe726a80a2a   mev-boost     0.00%     13.83MiB / 15.63GiB   0.09%     0B / 0B   4.76MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       13028         392           1        2581        2631

2022-09-13T18:00:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    35.78%    8.542GiB / 15.63GiB   54.66%    0B / 0B   439GB / 187GB    67
744ba7fb924a   beacon-node   25.62%    4.16GiB / 15.63GiB    26.62%    0B / 0B   12.1GB / 261GB   19
4257e099dca9   validator     0.18%     61.76MiB / 15.63GiB   0.39%     0B / 0B   357MB / 52.1MB   11
5fe726a80a2a   mev-boost     0.00%     13.42MiB / 15.63GiB   0.08%     0B / 0B   4.89MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       12870         372           1        2759        2792


2022-09-13T19:00:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    118.73%   8.46GiB / 15.63GiB    54.13%    0B / 0B   459GB / 196GB    68
744ba7fb924a   beacon-node   23.90%    3.965GiB / 15.63GiB   25.37%    0B / 0B   12.3GB / 265GB   19
4257e099dca9   validator     0.10%     61.82MiB / 15.63GiB   0.39%     0B / 0B   366MB / 53MB     11
5fe726a80a2a   mev-boost     0.00%     13.59MiB / 15.63GiB   0.08%     0B / 0B   5.14MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       13298         162           1        2541        2362

2022-09-13T21:30:00Z
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
e4cabe21cdc2   nethermind    188.09%   8.524GiB / 15.63GiB   54.54%    0B / 0B   514GB / 221GB    68
744ba7fb924a   beacon-node   2.79%     3.941GiB / 15.63GiB   25.22%    0B / 0B   13.1GB / 280GB   19
4257e099dca9   validator     0.08%     62.05MiB / 15.63GiB   0.39%     0B / 0B   385MB / 56.1MB   11
5fe726a80a2a   mev-boost     0.00%     13.67MiB / 15.63GiB   0.09%     0B / 0B   6.05MB / 0B      8
              total        used        free      shared  buff/cache   available
Mem:          16002       12890         170           1        2941        2772

nazar-kuzo avatar Sep 14 '22 17:09 nazar-kuzo

@kkokosa also attaching the available RAM chart for last 48 hours to understand the overall dynamics. (Spike at the beginning was due to Netherming restart to enable dotnet-monitor diagnostics)

image

nazar-kuzo avatar Sep 14 '22 17:09 nazar-kuzo