nethermind icon indicating copy to clipboard operation
nethermind copied to clipboard

nethermind memory leak?

Open kiss81 opened this issue 9 months ago • 9 comments

nethermind is leaking memory for me. I run it in an ubuntu noble proxmox lxc container with 6 cpu cores / 32gb RAM with nimbus as a consensus client. cpu load is very low ( <5%) I thought this might had to do with this issue (https://github.com/NethermindEth/nethermind/issues/5197) , but it's not solved for me in the latest version:

/usr/local/bin/nethermind/nethermind --version
Version:    1.31.6+4e68f8ee
Commit:     4e68f8eefac9a7aff7d04538088780826ee64f81
Build date: 2025-03-19 15:10:39Z
Runtime:    .NET 9.0.3
Platform:   Linux x64

startup command: /usr/local/bin/nethermind/nethermind --config mainnet --datadir /var/lib/nethermind --Sync.SnapSync true --Sync.AncientBodiesBarrier 11052984 --Sync.AncientReceiptsBarrier 11052984 --JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex --Init.BaseDbPath /var/lib/nethermind --HealthChecks.Enabled true --JsonRpc.Enabled true --JsonRpc.EngineHost 0.0.0.0 --JsonRpc.EnginePort 8551 --Network.MaxActivePeers 15 --init-memoryhint 2000000000

Not sure if it's normal but htop is showing lot's of nethermind processes (didn't count them, but > 30) Not sure what I can provide more, but let me know and I will add the info

Image (restarting the nethermind process regularly to work around the problem now...)

If I let the memory go up it will eventually crash the lxc container (100% cpu etc...) .

kiss81 avatar Apr 12 '25 10:04 kiss81

Hello, what is the use case of the node? If its rpc, which rpc method does it predominantly call?

asdacap avatar Apr 15 '25 02:04 asdacap

Hello, what is the use case of the node? If its rpc, which rpc method does it predominantly call?

It's running an ethereum validator node. (nimbus as consensus client)

kiss81 avatar Apr 15 '25 17:04 kiss81

Can you share more on this? htop is showing lot's of nethermind processes. And some logs?

asdacap avatar Apr 16 '25 15:04 asdacap

Can you share more on this? htop is showing lot's of nethermind processes. And some logs?

To see all nethermind threads I did a ps -eLf |grep nethermind > threads.log The amount of threads is more then the specified 6...

threads.log

I cleared the /var/lib/nethermind/logs/ as a lot of old logs were showing up. After an hour or so I checked the log files in /var/lib/nethermind/logs/ All is normal INFO Logging. Anything I should look for or enable?

kiss81 avatar Apr 16 '25 21:04 kiss81

Only warn log I see is this one (6x times now...) mainnet.logs.0.txt:2025-04-18 12:33:46.8816|WARN|HealthChecks.HealthChecksPlugin|50|No incoming messages from the consensus client that is required for sync.

No error log...

Is there anything I should try to change in my config?

kiss81 avatar Apr 20 '25 21:04 kiss81

Currently testing v1.31.11 as it seems to fix some memory leaks. I will report

kiss81 avatar May 22 '25 13:05 kiss81

I also had a memory leak and it was related with Filters. It was fixed by https://github.com/NethermindEth/nethermind/pull/8633 and https://github.com/NethermindEth/nethermind/pull/8632.

jesusvico avatar May 23 '25 09:05 jesusvico

Hard to tell if the leak is solved as the first 24 hours the RAM is increasing / settling a bit. Let's hope for the best. :)

kiss81 avatar May 23 '25 14:05 kiss81

Still an issue with v1.31.11: RAM grows from ~5GB to 18GB in 2 days. My workaround is ok for now: I use a crontab script that restart the nethermind service if RAM usage gets above 80%... Would be nice if that wasn't necessary of course.

kiss81 avatar May 26 '25 16:05 kiss81

@jesusvico @kiss81, are you still experiencing this issue with the latest version?

MarekM25 avatar Jul 01 '25 13:07 MarekM25

@jesusvico @kiss81, are you still experiencing this issue with the latest version?

I do use v1.32.2 right now and the issue is still there. The RAM usage increases like 4gb every 24h. I did make a "watchdog" script in linux that restarts the nethermind service if I reach my RAM limit as a workaround. It kind of works, but would be nice if it eventually gets fixed. :) Let me know if you need any more logging / info.

kiss81 avatar Jul 01 '25 21:07 kiss81

I also have a memory leak issue with Nethermind, usuing 1.31.9. This is my settings:

ExecStart=/usr/local/bin/nethermind/nethermind \
  --config mainnet \
  --datadir /var/lib/nethermind \
  --Sync.SnapSync true \
  --Sync.AncientBodiesBarrier 11052984 \
  --Sync.AncientReceiptsBarrier 11052984 \
  --JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex \
  --JsonRpc.AdditionalRpcUrls http://127.0.0.1:1337|http|admin \
  --JsonRpc.EnginePort 8551 \
  --JsonRpc.EngineHost 127.0.0.1 \
  --Pruning.FullPruningCompletionBehavior AlwaysShutdown \
  --Pruning.FullPruningTrigger=VolumeFreeSpace \
  --Pruning.FullPruningThresholdMb=375810 \
  --Pruning.FullPruningMaxDegreeOfParallelism 4 \
  --Pruning.FullPruningMemoryBudgetMb=16384 \
  --Network.ActivePeersMaxCount 30 \
  --Metrics.Enabled true \
  --JsonRpc.Enabled true \
  --JsonRpc.Host localIP

The thing is, I also have another node running 1.31.9 with almost the same settings (except that it doesn't have JsonRpc.Enabled true and JsonRpc.Host flags and it is doing fine with no memory leak issue.

Some screenshots of the memory leak node, it looks like it has something to do with file descriptor where it grows alongside with memory usage (the drop in memory on ~1st July is because I restarted Nethermind):

Image Image

I don't know why one having the issue but the other one doesn't have. OP, did you use some metrics related settings? This comment from Ethstaker discord is mentioning it has something to do with pushgateway metrics: https://discord.com/channels/694822223575384095/1024252427840540683/1383471122879615058

but I don't have pushgateway installed on the node with memory leak issue

One different I can think of the node with no memory leak issue is running Ubuntu 20.04 and the node with memory leak issue is running Ubuntu 24.04. Which OS are you using?

chong-he avatar Jul 06 '25 14:07 chong-he

Thank you for your extensive research @chong-he ! These are my current settings:

ExecStart=/usr/local/bin/nethermind/nethermind \
  --config mainnet \
  --datadir /var/lib/nethermind \
  --Sync.SnapSync true \
  --Sync.AncientBodiesBarrier 11052984 \
  --Sync.AncientReceiptsBarrier 11052984 \
  --JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex \
  --Init.BaseDbPath /var/lib/nethermind \
  --Network.MaxActivePeers 15 \
  --Init.MemoryHint 4096000000 \
  --Sync.MaxProcessingThreads 4 \
  --Pruning.CacheMb 2048

kiss81 avatar Jul 06 '25 21:07 kiss81

Thank you for your extensive research @chong-he ! These are my current settings:

ExecStart=/usr/local/bin/nethermind/nethermind \
  --config mainnet \
  --datadir /var/lib/nethermind \
  --Sync.SnapSync true \
  --Sync.AncientBodiesBarrier 11052984 \
  --Sync.AncientReceiptsBarrier 11052984 \
  --JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex \
  --Init.BaseDbPath /var/lib/nethermind \
  --Network.MaxActivePeers 15 \
  --Init.MemoryHint 4096000000 \
  --Sync.MaxProcessingThreads 4 \
  --Pruning.CacheMb 2048

Well we both also want to get the issue resolved, aren't we? I am trying different stuff before the last resort which is to change client. You don't have the metrics related flag so it may not be it. Do you have node exporter on and is seeing the file descriptor chart like my case? Which OS are you on?

Have you tried a fresh sync before? I am thinking could it be something in the database that "triggers" it and if a fresh sync would help

chong-he avatar Jul 06 '25 23:07 chong-he

@chong-he yeah would be great if can find the cause and even better a solution :) I don't have any metrics enabled now, but I had it enabled in the past. I didn't try a fresh sync, so that could be worth a try. I am on ubuntu noble (24.04) in a LXC proxmox container

kiss81 avatar Jul 07 '25 12:07 kiss81

@chong-he yeah would be great if can find the cause and even better a solution :) I don't have any metrics enabled now, but I had it enabled in the past. I didn't try a fresh sync, so that could be worth a try. I am on ubuntu noble (24.04) in a LXC proxmox container

We are on the same OS. But I presume there are many others also using Ubuntu 24.04, but I haven't seen more reports of memory leak apart from this issue

My next thing that I want to try is to remove --Metrics.Enabled true but since you mention without metrics also having this issue, probably it wouldn't help. If you did try a resync, I would appreciate if you could update here if it resolves the issue. I might also try a resync if nothing works out at the moment

chong-he avatar Jul 07 '25 12:07 chong-he

I doubt it's an OS related issue, but you never know... I now run with --Metrics.Enabled false to make sure metrics are disabled. I am not so keen on doing a full resync, but it's worth a try.

kiss81 avatar Jul 07 '25 13:07 kiss81

I doubt it's an OS related issue, but you never know... I now run with --Metrics.Enabled false to make sure metrics are disabled. I am not so keen on doing a full resync, but it's worth a try.

yeah that's what I mean. There are many using Ubuntu 24.04 so if it were the issue, we would have seen many more reports, but we aren't. Let me know if you found the flag helps, thanks!

chong-he avatar Jul 07 '25 15:07 chong-he

The same issue persists, but it involves an RPC node that is generating a significant number of eth_blockNumber calls. Version 1.31.11

naviat avatar Jul 28 '25 12:07 naviat

The same issue persists, but it involves an RPC node that is generating a significant number of eth_blockNumber calls. Version 1.31.11

So it's an issue caused by Nimbus? Or should it be solved in nethermind?

kiss81 avatar Jul 28 '25 17:07 kiss81

@kiss81 it's nethermind. I'm using Nethermind with Lighhouse as CL. This is my config with 20GB memory

/nethermind/nethermind --config="gnosis" 
         --datadir=/data 
         --baseDbPath=/data 
         --Mining.MinGasPrice=1 
         --Blocks.MinGasPrice=1 
         --Network.P2PPort=30303 
         --Network.DiscoveryPort=30303 
         --JsonRpc.Enabled=true 
         --JsonRpc.Host=0.0.0.0 
         --JsonRpc.Port=8545 
         --JsonRpc.EnabledModules=net,eth,web3,subscribe,debug,trace,parity 
         --Init.WebSocketsEnabled=true 
         --Init.StateDbKeyScheme=HalfPath 
         --JsonRpc.WebSocketsPort=8546 
         --Metrics.Enabled=true 
         --Metrics.ExposePort=6060 
         --HealthChecks.Enabled=true 
         --JsonRpc.GasCap=100000000 
         --JsonRpc.Timeout=20000 
         --JsonRpc.MaxRequestBodySize=30000000 
         --JsonRpc.EthModuleConcurrentInstances=2048 
         --JsonRpc.JwtSecretFile=/consensus/jwtsecret 
         --JsonRpc.EngineHost=0.0.0.0 
         --JsonRpc.EnginePort=8551 
         --JsonRpc.EngineEnabledModules="net,eth,web3,subscribe,debug,trace,parity" 
         --Pruning.CacheMb=1024 
         --Pruning.FullPruningMaxDegreeOfParallelism=0 
         --Pruning.FullPruningMinimumDelayHours=240 
         --Pruning.FullPruningThresholdMb=256000 
         --Pruning.FullPruningTrigger=StateDbSize 
         --Pruning.Mode="Hybrid" 
         --Pruning.PersistenceInterval=8192 
         --Pruning.FullPruningMemoryBudgetMb=8000
Image

naviat avatar Jul 29 '25 03:07 naviat

@kiss81 it's nethermind. I'm using Nethermind with Lighhouse as CL. This is my config with 20GB memory

Are you referring to the memory increase from about 15:00? What happens before that? there is no memory leak before that?

chong-he avatar Aug 04 '25 03:08 chong-he

@chong-he you're right, the memory increased before

Image

naviat avatar Aug 07 '25 21:08 naviat

Any comment @chong-he ?

avinashbo avatar Aug 13 '25 06:08 avinashbo

After a fresh sync, my Nethermind is no longer memory leak, at least I think so, just by re-syncing a fresh Nethermind (also using 1.31.9): Before 6th Aug, it is with memory leak issue, after 6th Aug and a resync, it is good now (note that the sudden increase in memory after 6th Aug is caused by another process, not Nethermind, I checked htop and confirm this)

Image

The reason I took so long to respond is because each round of testing (change some flags etc) needs 2-3 days to see pattern, and 4-5 days to confirm if the memory leak is there. At last, all didn't work and I just go ahead with a fresh sync. After running 10 days, I can say the newly sync Nethermind is no longer having memory leak, with memory usage between 3.5-4.5GB range in 10 days time

So my theory is something related to the database that caused the memory leak (of course I can be wrong). But what else can explain this? The flags I used before and after are pretty much the same (I remove some when a fresh sync to make things simpler). This is the flags I use in a fresh sync:

ExecStart=/usr/local/bin/nethermind/nethermind \
  --config mainnet \
  --datadir /var/lib/nethermind2 \
  --Pruning.Mode=Hybrid \
  --JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex \
 --Network.P2PPort 30330 \
--Network.DiscoveryPort 30330 \
  --JsonRpc.EnginePort 8551 \
  --JsonRpc.EngineHost 127.0.0.1 \
  --Pruning.FullPruningCompletionBehavior AlwaysShutdown \
  --Pruning.FullPruningTrigger=VolumeFreeSpace \
  --Pruning.FullPruningThresholdMb=375810 \
  --Pruning.FullPruningMaxDegreeOfParallelism 4 \
  --Pruning.FullPruningMemoryBudgetMb=16384 \
  --Network.ActivePeersMaxCount 30 \
  --JsonRpc.Enabled true \
  --JsonRpc.Host local-IP

I hope this helps

chong-he avatar Aug 15 '25 10:08 chong-he

I removed everything and did a full resync: disk space is reduced, but unfortunately I still have the memory leak issue...

edit: not sure yet, seems like it is a bit less. Will report in a few days

kiss81 avatar Aug 23 '25 15:08 kiss81

Still no luck after a full resync:

Image

What I also notice is that shutting down nethermind is not really working. Seems like it times out stopping it using systemd and gets killed... I tried different systemd startup scripts. Currently using this:

[Service]
Type=simple
User=nethermind
Group=nethermind
Restart=always
RestartSec=5
KillSignal=SIGINT
TimeoutStopSec=900
WorkingDirectory=/var/lib/nethermind
Environment="DOTNET_BUNDLE_EXTRACT_BASE_DIR=/var/lib/nethermind"
ExecStart=/usr/local/bin/nethermind/nethermind \
  --config mainnet \
  --datadir /var/lib/nethermind \
  --JsonRpc.JwtSecretFile /var/lib/jwtsecret/jwt.hex \
  --Init.BaseDbPath /var/lib/nethermind \
  --Network.DiscoveryPort 30303 \
  --Network.P2PPort 30303 \
  --Network.MaxActivePeers 50 \
  --JsonRpc.Port 8545 \
  --JsonRpc.EnginePort 8551 \
  --Metrics.Enabled true \
  --Metrics.ExposePort 6060 \
  --Pruning.Mode=Hybrid \
  --Pruning.FullPruningTrigger=VolumeFreeSpace \
  --Pruning.FullPruningThresholdMb=375810 \
  --Pruning.FullPruningMemoryBudgetMb=16384 \
  --Pruning.FullPruningMaxDegreeOfParallelism=2 \
  --Pruning.FullPruningCompletionBehavior=AlwaysShutdown

edit: will retry with metrics disabled

kiss81 avatar Aug 25 '25 13:08 kiss81

shutting down nethermind is not really working. Seems like it times out stopping it using systemd and gets killed...

Are you using sudo systemctl stop xx to stop Nethermind? If so, it should work normally. What you are describing doesn't sound right. Does the stopping not working and gets killed every time?

What is your SSD model?

chong-he avatar Aug 25 '25 13:08 chong-he

shutting down nethermind is not really working. Seems like it times out stopping it using systemd and gets killed...

Are you using sudo systemctl stop xx to stop Nethermind? If so, it should work normally. What you are describing doesn't sound right. Does the stopping not working and gets killed every time?

What is your SSD model?

It's running on a samsung 990 pro 4tb. I checked the IO speed and it's super fast. (2000MB's+) Yeah I do use systemctl stop to stop it. I checked the log and the shutdown looks normal, but after shutdown still the|"INFO|Synchronization.Reporting.SyncReport" keeps coming... Is that caused by metrics?

2025-08-25 13:24:35.3803|INFO|Program|212|Nethermind is shutting down... Please wait until all activities are stopped.
2025-08-25 13:24:35.3836|INFO|lambda_method188|212|Stopping session monitor
2025-08-25 13:24:35.4026|INFO|lambda_method188|212|Stopping session sync mode selector
2025-08-25 13:24:35.4056|INFO|Synchronization.ParallelSync.MultiSyncModeSelector|212|Sync mode selector stopped
2025-08-25 13:24:35.4056|INFO|lambda_method188|212|Stopping discovery app
2025-08-25 13:24:35.4135|INFO|Network.Discovery.DiscoveryConnectionsPool|212|Stopping discovery udp channel on port 30303
2025-08-25 13:24:35.4190|INFO|lambda_method188|212|Stopping block producer
2025-08-25 13:24:35.4220|INFO|lambda_method188|212|Stopping peer pool
2025-08-25 13:24:35.4229|INFO|Network.Discovery.DiscoveryApp|247|Discovery shutdown complete.. please wait for all components to close
2025-08-25 13:24:35.4256|INFO|lambda_method188|212|Stopping peer manager
2025-08-25 13:24:35.4256|INFO|Network.PeerPool|98|Peer Pool shutdown complete.. please wait for all components to close
2025-08-25 13:24:35.4291|INFO|Network.PeerManager|212|Peer Manager shutdown complete.. please wait for all components to close
2025-08-25 13:24:35.4291|INFO|lambda_method188|212|Stopping blockchain processor
2025-08-25 13:24:35.4329|INFO|lambda_method188|212|Stopping RLPx peer
2025-08-25 13:24:35.4329|INFO|Consensus.Processing.BlockchainProcessor|247|Blockchain Processor shutdown complete.. please wait for all components to close
2025-08-25 13:24:35.7719|INFO|Network.PeerManager|176|Peer update loop canceled
2025-08-25 13:24:36.4481|INFO|Network.Rlpx.RlpxHost|176|Local peer shutdown complete.. please wait for all components to close
2025-08-25 13:24:36.4490|INFO|lambda_method188|176|Disposing plugin Ethash
2025-08-25 13:24:36.4499|INFO|lambda_method188|176|Disposing plugin Merge
2025-08-25 13:24:36.4499|INFO|lambda_method188|176|Disposing plugin HealthChecks
2025-08-25 13:24:36.4539|INFO|lambda_method188|176|Disposing plugin Flashbots
2025-08-25 13:24:36.4565|INFO|Core.DisposableStack|176|Disposing Nethermind.Consensus.Producers.ProducedBlockSuggester
2025-08-25 13:24:36.4575|INFO|Core.DisposableStack|176|Disposing Nethermind.Runner.JsonRpc.JsonRpcIpcRunner
2025-08-25 13:24:36.4590|INFO|Runner.JsonRpc.JsonRpcIpcRunner|176|IPC JSON RPC service stopped
2025-08-25 13:24:36.4590|INFO|Core.DisposableStack|176|Disposing Nethermind.Core.Reactive+AnonymousDisposable
2025-08-25 13:24:36.4730|INFO|Core.DisposableStack|176|Disposing Nethermind.HealthChecks.ClHealthRequestsTracker
2025-08-25 13:24:36.4730|INFO|Core.DisposableStack|176|Disposing Nethermind.Core.Reactive+AnonymousDisposable
2025-08-25 13:24:36.4741|INFO|Core.DisposableStack|176|Disposing Nethermind.JsonRpc.Modules.Eth.FeeHistory.FeeHistoryOracle
2025-08-25 13:24:36.4755|INFO|Core.DisposableStack|176|Disposing Nethermind.Consensus.Scheduler.BackgroundTaskScheduler
2025-08-25 13:24:36.4785|INFO|Runner.Ethereum.JsonRpcRunner|247|JSON RPC service stopped
2025-08-25 13:26:32.0219|INFO|Synchronization.Reporting.SyncReport|594|Peers: 50 | with best block: 50 | eth68 (100 %) | Active: None | Sleeping: All
2025-08-25 13:29:02.0243|INFO|Synchronization.Reporting.SyncReport|487|Peers: 50 | node diversity :  Reth (66 %), Geth (18 %), Nethermind (12 %), Erigon (2 %), Besu (2 %)
2025-08-25 13:31:32.0265|INFO|Synchronization.Reporting.SyncReport|240|Peers: 50 | with best block: 50 | eth68 (100 %) | Active: None | Sleeping: All
2025-08-25 13:34:02.0295|INFO|Synchronization.Reporting.SyncReport|242|Peers: 50 | node diversity :  Reth (66 %), Geth (18 %), Nethermind (12 %), Erigon (2 %), Besu (2 %)
2025-08-25 13:36:32.0351|INFO|Synchronization.Reporting.SyncReport|347|Peers: 50 | with best block: 50 | eth68 (100 %) | Active: None | Sleeping: All
2025-08-25 13:39:02.0351|INFO|Synchronization.Reporting.SyncReport|347|Peers: 50 | node diversity :  Reth (66 %), Geth (18 %), Nethermind (12 %), Erigon (2 %), Besu (2 %)

kiss81 avatar Aug 25 '25 14:08 kiss81

SyncReport

The SyncReport log emits regularly (I don't know the frequency), that's fine. You mentioned it gets killed, but the logs you posted look normal to me?

chong-he avatar Aug 26 '25 00:08 chong-he