cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

[Bug]: increased RAM usage on cosmos-sdk v50 full nodes

Open MSalopek opened this issue 1 year ago • 10 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

What happened?

Given recent user reports on the Hub and at least one other network we're looking for guidance about potetntial performance bottlenecks on full nodes running cosmos-sdk v50.

Related issue on Gaia:

  • https://github.com/cosmos/gaia/issues/3417

This comment is especially concerning as it points to increased RAM usage related to goleveldb and iavl.

If possible, we would like to move this issue to cosmos-sdk. We can also provide information coming from other chains and node operators.

Please advise.

Thank you!

Cosmos SDK Version

v0.50.x

How to reproduce?

Not clear at this point.

It seems that changing iavl-cache-size can help mitigate the issues.

Nodes seem to be used as query nodes.

MSalopek avatar Nov 26 '24 16:11 MSalopek

this is the frist we have heard of this, on my nodes i havent observed any sort of increase. do you know if its reproducible?

tac0turtle avatar Nov 27 '24 08:11 tac0turtle

Thank you for checking it out!

We were not able to reproduce. I checked with another team that experienced similar issue but those were related to a custom module - fixing the module alleviated the issue.

In the past we have seen this with RPC query nodes with no rate limits/caching. There are some inefficiencies in the staking module that were documented earlier.

I'm still waiting for further details, but as it stands I cannot link this to any previous issues or reproduce it reliably.

MSalopek avatar Nov 27 '24 13:11 MSalopek

amazing, let us know, we are around to help if needed. There are gas on queries so the node should cancel if the query is too large. But it could be a memory leak somewhere. Not sure i think its in iavl as its been running for a while with no issues

tac0turtle avatar Nov 27 '24 13:11 tac0turtle

@MSalopek personally I do agree that there's a memory leak somewhere. I have not been able to find that memory leak and I'm really curious about exactly what SDK you upgraded from and exactly what SDK you upgraded to because that can of course influence the memory consumption.

So to be clear:

  • Yes I think the SDK leaks ram
  • Yes I have hunted the leak
  • Sadly my hunt failed

.... The version change thing is kind of a new twist for me but I'm happy to help to investigate

faddat avatar Nov 27 '24 16:11 faddat

I can confirm that we too are seeing a gradual increase in RAM for one of the projects we are running. Will perform some memory profiling and share it here.

dillu24 avatar Jan 08 '25 13:01 dillu24

thank you, that will help immensely to seeing where its happening.

tac0turtle avatar Jan 08 '25 13:01 tac0turtle

This is some mem usage data. Have been running pyroscope on a validator for the past week:

Image

dillu24 avatar Feb 11 '25 09:02 dillu24

I believe I have the same issue. I'm not sure how to best reproduce other than spinning up a local dev node and stressing it with many transactions that create new resources on chain for 60min. The memory consumed by the node will increase slowly, and won't go back down unless you manually restart it.

I believe this is due to the IAVL cache accounting Node objects instead of total size in bytes. Over time, the size of a IAVL "Node" can grow, which will sit in the IAVL cache forever (until restart).

https://github.com/cosmos/iavl/blob/master/cache/cache.go#L63-L80

This explains why:

  • Hard to reproduce, as you need a way to stress the node in such a way that the size of IAVL Node increases.
  • Lowering the iavl-cache-size helps but doesn't solve the issue.
  • Memory doesn't get freed.
  • Also matches the flame graph provided by @dillu24

I notice also there is a parallel cache for "fast node", that has a fixed 100k max size. This may contribute to the leak as well.

UPDATE: Nevermind, I don't think this is it.. I did some more tests and seems the cache structures barely take up a few MB of space. The IAVL Node's are not growing in size. I did notice that the left/right pointers can include nodes not included in the cache, which can accumulate over time.

conorpp avatar May 31 '25 14:05 conorpp

Hi all, we're also observing memory problems with our nodes on the SEDA testnet. After some digging we're also suspecting that it has something to do with the IAVL store.

One of our validator nodes was set up to create snapshots every 20 blocks (misconfiguration, we don't intend to use this frequency) and it was running into OOM errors at a faster rate than our other nodes. We have the following graphs for memory % of the entire node (so OS level) and the Go memory.

The first section until 16:00 is with the 20 block frequency, after that we changed the interval to 200000 and since then the memory has been much more stable (though still ever so slightly increasing.

Image

Image

Comparing this to another validator node that is set up with an interval of 2000 blocks (roughly 4 hours with our blocktime) and the spikes in memory usage align with the snapshot interval.

Image

Other nodes that were not taking snapshots also display spikes in memory that aligned with our load tests.

Image

We think this points to something leaking in the IAVL, and looking at a heap profile taken from the validator with the 2000 block interval we see that the IAVL exporter takes up a significant amount of memory. But I'll be the first to admit I have very little experience with pprof and reading the traces so it may not be anything. I attached the profile as well with a .txt extension so GH doesn't complain. Hopefully this can help someone dig into this.

val_2_heap.txt

Thomasvdam avatar Jun 17 '25 09:06 Thomasvdam

Hey, this is good data!

Keep it up!

I have known for a long time that the SDK leaks ram.... Yet another reason to move the iavl library into the SDK.

On Tue, 17 Jun 2025, 05:06 Thomas van Dam, @.***> wrote:

Thomasvdam left a comment (cosmos/cosmos-sdk#22657) https://github.com/cosmos/cosmos-sdk/issues/22657#issuecomment-2979557720

Hi all, we're also observing memory problems with our nodes on the SEDA testnet. After some digging we're also suspecting that it has something to do with the IAVL store.

One of our validator nodes was set up to create snapshots every 20 blocks (misconfiguration, we don't intend to use this frequency) and it was running into OOM errors at a faster rate than our other nodes. We have the following graphs for memory % of the entire node (so OS level) and the Go memory.

The first section until 16:00 is with the 20 block frequency, after that we changed the interval to 200000 and since then the memory has been much more stable (though still ever so slightly increasing.

val1-20.png (view on web) https://github.com/user-attachments/assets/62b22062-74f6-4d4b-93f6-8e8c27eef9d1

val1-20-gomem.png (view on web) https://github.com/user-attachments/assets/00cb0914-0c3e-4dc0-9681-30a2392878ea

Comparing this to another validator node that is set up with an interval of 2000 blocks (roughly 4 hours with our blocktime) and the spikes in memory usage align with the snapshot interval.

val2-2000.png (view on web) https://github.com/user-attachments/assets/5e9dc7f6-0cd9-49d7-b6b1-851fef6b7f50

Other nodes that were not taking snapshots also display spikes in memory that aligned with our load tests.

rpc2.png (view on web) https://github.com/user-attachments/assets/775a88f5-65f0-4965-b096-426a002c5a7e

We think this points to something leaking in the IAVL, and looking at a heap profile taken from the validator with the 2000 block interval we see that the IAVL exporter takes up a significant amount of memory. But I'll be the first to admit I have very little experience with pprof and reading the traces so it may not be anything. I attached the profile as well with a .txt extension so GH doesn't complain. Hopefully this can help someone dig into this.

val_2_heap.txt https://github.com/user-attachments/files/20772115/val_2_heap.txt

— Reply to this email directly, view it on GitHub https://github.com/cosmos/cosmos-sdk/issues/22657#issuecomment-2979557720, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWPVCNQT5A6PRWE7DI7XXL3D7LCVAVCNFSM6AAAAABSQ5FCW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZZGU2TONZSGA . You are receiving this because you commented.Message ID: @.***>

faddat avatar Jun 17 '25 20:06 faddat

Another pprof profile from a validator node that was not taking any snapshots profile_heap.txt

Compared to prior profiles more and more memory is being retained by the consensus state (apologies for the different size images):

Image Image

Thomasvdam avatar Jun 19 '25 15:06 Thomasvdam

Thanks so much for raising this and sharing this info. As part of our plans to fix the IAVL implementation, we'll be looking to bring the library into the SDK and productionize the existing IAVL v2 work

aljo242 avatar Jun 23 '25 17:06 aljo242

Closing for now as this is stale

aljo242 avatar Jul 28 '25 19:07 aljo242

how is this stale? most of the ecosystem is still on v50.

tac0turtle avatar Jul 28 '25 20:07 tac0turtle

Hi @aljo242 , what's the latest on this memory leak? I was reminded of this issue after once again having to restart our nodes.

Thomasvdam avatar Sep 12 '25 15:09 Thomasvdam

how is this stale? most of the ecosystem is still on v50.

Still on v50 here, still seeing OOM issues.

SpicyLemon avatar Nov 10 '25 16:11 SpicyLemon