prysm icon indicating copy to clipboard operation
prysm copied to clipboard

OOM and Utilization Issues when using Prysm v5

Open fhildeb opened this issue 1 year ago • 15 comments

Describe the bug

I'm running a Prysm validator on LUKSO (Layer 1 EVM up to date with Shanghai-Capella). Related to the upcoming Cancun-Deneb fork, other homestakers and I upgraded to Prysm v5.0.3.

Since upgrading, I have:

  • much higher CPU usage (11-30% instead of 3-5%) and temps (75-85C instead of 35-40C)
  • the occupied physical memory constantly grows until I get an OOM error

The EL Client stayed the same all the time (used Geth 1.14.0)

After reaching the maximum memory, the CPU spikes up (to 75%) until Prysm crashes. Until the OOM Error from the OS, there are no visible warnings or errors in the logs. I'm using 32GB of RAM- so the memory of the Prysm client is crashing after 48-55 hours. Other LUKSO community members running Prysm validators reported similar errors after upgrading- the client crashes just around a day for those with only 16GB of RAM.

Every time it crashed, I reverted back to one version to trim down where the root cause was introduced. So far, I've got the same OOM issue for v5.0.3, v5.0.2, v5.0.1, and 5.0.0- coming to the conclusion that it got introduced with v5. When downgrading to 4.2.1, everything returns to normal, and the physical memory of the validator and consensus client combined does not grow beyond 5GB

As Prysm crashed, I've always started a clean setup, removing all previous blockchain data gathered during the previous try. I've used checkpoint sync to quickly get back online. Therefore, it might be that this memory issue exists while the EL client is syncing. However, I did not investigate too much, and this is plainly speculative.

I've also seen other issues being opened about OOM lately:

  • https://github.com/prysmaticlabs/prysm/issues/13963
  • https://github.com/prysmaticlabs/prysm/issues/13964
  • https://github.com/prysmaticlabs/prysm/issues/13845
  • ...

As well as a draft PR about a potential memory bugfix:

  • https://github.com/prysmaticlabs/prysm/pull/14011

Would love to know:

  • What going on with the increased CPU usage, or is it related to the growing memory
  • If there are certain flags/configurations necessary to reduce resources

Monitoring V5.0.2

monitoring_node_prysm_v5

Returning back to V4.2.1 after it crashed

Bildschirmfoto 2024-05-17 um 11 35 59

Has this worked before in a previous version?

Yes. 4.2.1

🔬 Minimal Reproduction

  1. Start Geth v1.14.0 with these Geth parameters
  2. Start Prysm v5.0.0, v5.0.1, v5.0.2, or v5.0.3 with these Prysm and these Validator parameters
  3. Wait to see the physical memory grow indefinitely
  4. After using up all accessible physical memory, the client will crash

To simplify starting clients, I've used the LUKSO CLI Tool to create a JWT & load the network configuration. However, it just starts up the EL/CL clients and should not be related.

Error

OS ERROR: OMM (Out of Memory)- Prysm Process crashed.

Platform(s)

Linux (x86)

What version of Prysm are you running? (Which release)

v5.0.0 and above

Anything else relevant (validator index / public key)?

Used OS/Hardware:

  • Operating System: Ubuntu 22.04.2 Server
  • Processor: Intel Core i7-10710U (4.7 GHz, 6 Cores, 12 Threads)
  • Motherboard: Intel NUC 10 (NUC10i7FNHN)
  • RAM: 32GB DDR4

fhildeb avatar May 17 '24 13:05 fhildeb

gm, what flags are you using the run Prysm?

prestonvanloon avatar May 17 '24 13:05 prestonvanloon

Actually, I see your flags. Thanks

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary

prestonvanloon avatar May 17 '24 13:05 prestonvanloon

Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

We are still investigating the OOMs you have linked, but we know that --subscribe-all-subnets often doubles the memory requirement and your peer count is really high so I suspect these are the reasons your node is crashing

prestonvanloon avatar May 17 '24 13:05 prestonvanloon

GM, wow, thanks for the quick response.

Yeah my flags are included in the config files/parameters above: prysm.yaml and validator.yaml

Actually, I've already adjusted the max peers to 100, should've stated that 😅 But will try to turn off the --subscribe-all-subnets and report back. 🙏🏻

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

fhildeb avatar May 17 '24 13:05 fhildeb

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

prestonvanloon avatar May 17 '24 13:05 prestonvanloon

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

mxmar avatar May 17 '24 17:05 mxmar

Is there any reason why this is not an issue in 4.2.1, tho? If it takes tremendously more memory?

v4 doesn't have subnets for blobs

Is this subnet required for Deneb validators?

Yes. Blobs are required in deneb

prestonvanloon avatar May 21 '24 19:05 prestonvanloon

I just wanted to say this is an awesome issue report @fhildeb

rkapka avatar May 27 '24 05:05 rkapka

Try turning off subscribe all subnets. That uses a huge amount of memory and is rarely necessary Also lower your max peers to something sensible like 100. Both of those flags will require more and more memory.

Used a max peer count of 100 and also tried with just 50. I also turned all subnets to false.

The issue remains (used v5.0.3 in this case and waited 2 days again).

The LUKSO network does not have blobs yet- as it's only up to Shanghai-Capella (as stated in the report), so the configuration should not cause these increases compared to v4.2.1.

fhildeb avatar May 27 '24 07:05 fhildeb

Hey @prestonvanloon :wave: Is there any new update about this issue? Does https://github.com/prysmaticlabs/prysm/releases/tag/v5.0.4 solve this problem?

externalman avatar Jul 15 '24 09:07 externalman

We are still experiencing the issue on v5.0.4

MrKoberman avatar Jul 26 '24 10:07 MrKoberman

We are still experiencing the issue on v5.1.0

git-ljm avatar Sep 12 '24 01:09 git-ljm

what flags are you running this with @git-ljm and which network is this with ?

nisdas avatar Sep 12 '24 03:09 nisdas

Thanks for the active replies and work on this front, @prestonvanloon, @nisdas, and @rkapka 🙏🏻 I've seen many bug fixes and improvements in versions 5.0.4 and 5.1.0 that might have caused or been related to my troubles.

I've updated my nodes this week and experimented with versions 5.0.4 and 5.1.0 using the same configs described above (having --subscribe-all-subnets turned off). The problem did not occur within the respective version's 2-3 day runtime and can be marked as solved. ✅

If you still have problems, @git-ljm & @MrKoberman, I suggest opening your own issue with specific details.

fhildeb avatar Oct 05 '24 15:10 fhildeb

Update: After testing, I updated my production environment to 5.0.4, the latest version that LUKSO currently supports @mxmar @wolmin. Unfortunately, after four days of running, Prysm stacks up a lot of memory again, as described above. Even the CPU temperatures and utilization increase, and temps have doubled compared to Prysm 4.2.1 while on the same network and configuration.

Node Utility

I recommend leaving this issue open until I can test 5.1.0 in productionas it fixed some memory leaks that I consider essential. However, I only want to test on the mainnet once the LUKSO team whitelists the newer version in their CLI. Hopefully, this will be soon- as these issues affect some homestakers in the community.

fhildeb avatar Oct 07 '24 15:10 fhildeb

It seems the issue was that setting subscribe-all-subnets to false does not work correctly. I resolved the issue by simply removing the parameter.

I also turned the slasher off to ensure lower memory usage. The node has been running smoothly for the last 2 months using 5.0.4.

fhildeb avatar Jan 10 '25 20:01 fhildeb