prysm OOM and Utilization Issues when using Prysm v5

Describe the bug

I'm running a Prysm validator on LUKSO (Layer 1 EVM up to date with Shanghai-Capella). Related to the upcoming Cancun-Deneb fork, other homestakers and I upgraded to Prysm v5.0.3.

Since upgrading, I have:

much higher CPU usage (11-30% instead of 3-5%) and temps (75-85C instead of 35-40C)
the occupied physical memory constantly grows until I get an OOM error

The EL Client stayed the same all the time (used Geth 1.14.0)

After reaching the maximum memory, the CPU spikes up (to 75%) until Prysm crashes. Until the OOM Error from the OS, there are no visible warnings or errors in the logs. I'm using 32GB of RAM- so the memory of the Prysm client is crashing after 48-55 hours. Other LUKSO community members running Prysm validators reported similar errors after upgrading- the client crashes just around a day for those with only 16GB of RAM.

Every time it crashed, I reverted back to one version to trim down where the root cause was introduced. So far, I've got the same OOM issue for v5.0.3, v5.0.2, v5.0.1, and 5.0.0- coming to the conclusion that it got introduced with v5. When downgrading to 4.2.1, everything returns to normal, and the physical memory of the validator and consensus client combined does not grow beyond 5GB

As Prysm crashed, I've always started a clean setup, removing all previous blockchain data gathered during the previous try. I've used checkpoint sync to quickly get back online. Therefore, it might be that this memory issue exists while the EL client is syncing. However, I did not investigate too much, and this is plainly speculative.

I've also seen other issues being opened about OOM lately:

https://github.com/prysmaticlabs/prysm/issues/13963
https://github.com/prysmaticlabs/prysm/issues/13964
https://github.com/prysmaticlabs/prysm/issues/13845
...

As well as a draft PR about a potential memory bugfix:

https://github.com/prysmaticlabs/prysm/pull/14011

Would love to know:

What going on with the increased CPU usage, or is it related to the growing memory
If there are certain flags/configurations necessary to reduce resources

Monitoring V5.0.2

monitoring_node_prysm_v5

Returning back to V4.2.1 after it crashed

Has this worked before in a previous version?

Yes. 4.2.1

🔬 Minimal Reproduction

Start Geth v1.14.0 with these Geth parameters
Start Prysm v5.0.0, v5.0.1, v5.0.2, or v5.0.3 with these Prysm and these Validator parameters
Wait to see the physical memory grow indefinitely
After using up all accessible physical memory, the client will crash

To simplify starting clients, I've used the LUKSO CLI Tool to create a JWT & load the network configuration. However, it just starts up the EL/CL clients and should not be related.

Error

OS ERROR: OMM (Out of Memory)- Prysm Process crashed.

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

v5.0.0 and above

Anything else relevant (validator index / public key)?

Used OS/Hardware:

Operating System: Ubuntu 22.04.2 Server
Processor: Intel Core i7-10710U (4.7 GHz, 6 Cores, 12 Threads)
Motherboard: Intel NUC 10 (NUC10i7FNHN)
RAM: 32GB DDR4

May 17 '24 13:05 fhildeb

gm, what flags are you using the run Prysm?

May 17 '24 13:05 prestonvanloon

Actually, I see your flags. Thanks

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary

May 17 '24 13:05 prestonvanloon

Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

We are still investigating the OOMs you have linked, but we know that --subscribe-all-subnets often doubles the memory requirement and your peer count is really high so I suspect these are the reasons your node is crashing

May 17 '24 13:05 prestonvanloon

GM, wow, thanks for the quick response.

Yeah my flags are included in the config files/parameters above: prysm.yaml and validator.yaml

Actually, I've already adjusted the max peers to 100, should've stated that 😅 But will try to turn off the --subscribe-all-subnets and report back. 🙏🏻

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

May 17 '24 13:05 fhildeb

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

May 17 '24 13:05 prestonvanloon

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

May 17 '24 17:05 mxmar

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

Yes. Blobs are required in deneb

May 21 '24 19:05 prestonvanloon

I just wanted to say this is an awesome issue report @fhildeb

May 27 '24 05:05 rkapka

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

Used a max peer count of 100 and also tried with just 50. I also turned all subnets to false.

The issue remains (used v5.0.3 in this case and waited 2 days again).

The LUKSO network does not have blobs yet- as it's only up to Shanghai-Capella (as stated in the report), so the configuration should not cause these increases compared to v4.2.1.

May 27 '24 07:05 fhildeb

Hey @prestonvanloon :wave: Is there any new update about this issue? Does https://github.com/prysmaticlabs/prysm/releases/tag/v5.0.4 solve this problem?

Jul 15 '24 09:07 externalman

We are still experiencing the issue on v5.0.4

Jul 26 '24 10:07 MrKoberman

We are still experiencing the issue on v5.1.0

Sep 12 '24 01:09 git-ljm

what flags are you running this with @git-ljm and which network is this with ?

Sep 12 '24 03:09 nisdas

Thanks for the active replies and work on this front, @prestonvanloon, @nisdas, and @rkapka 🙏🏻 I've seen many bug fixes and improvements in versions 5.0.4 and 5.1.0 that might have caused or been related to my troubles.

I've updated my nodes this week and experimented with versions 5.0.4 and 5.1.0 using the same configs described above (having --subscribe-all-subnets turned off). The problem did not occur within the respective version's 2-3 day runtime and can be marked as solved. ✅

If you still have problems, @git-ljm & @MrKoberman, I suggest opening your own issue with specific details.

Oct 05 '24 15:10 fhildeb

Update: After testing, I updated my production environment to 5.0.4, the latest version that LUKSO currently supports @mxmar @wolmin. Unfortunately, after four days of running, Prysm stacks up a lot of memory again, as described above. Even the CPU temperatures and utilization increase, and temps have doubled compared to Prysm 4.2.1 while on the same network and configuration.

I recommend leaving this issue open until I can test 5.1.0 in production—as it fixed some memory leaks that I consider essential. However, I only want to test on the mainnet once the LUKSO team whitelists the newer version in their CLI. Hopefully, this will be soon- as these issues affect some homestakers in the community.

Oct 07 '24 15:10 fhildeb

It seems the issue was that setting subscribe-all-subnets to false does not work correctly. I resolved the issue by simply removing the parameter.

I also turned the slasher off to ensure lower memory usage. The node has been running smoothly for the last 2 months using 5.0.4.

Jan 10 '25 20:01 fhildeb

prysm prysm copied to clipboard

OOM and Utilization Issues when using Prysm v5

Describe the bug

Monitoring V5.0.2

Returning back to V4.2.1 after it crashed

Has this worked before in a previous version?

🔬 Minimal Reproduction

Error

Platform(s)

What version of Prysm are you running? (Which release)

Anything else relevant (validator index / public key)?

prysm
prysm copied to clipboard