foyer Unbounded memory growth when using obtain and replacing an item in the cache

We're attempting to use this projects as a replacement for our homegrown memory/disk cache built around Moka and Cacache. We're seeing an issue with memory growing unbound over time, eventually leading to the service going OOM. We've added measures to ensure we leave a percentage of the host OS memory always reserved. As far as I can tell, Foyer always reports that the memory used by the cache is within the limits.

Our current suspicion is around how we are using the obtain method here. Heaptrack implies that there is a memory leak in this function call.

We have have complicated types that we cache that get serialized and gossiped around across services. To avoid repeated deserialization costs, when something is retrieved from the cache that is still serialized, we deserialize and insert the new value behind the same key before returning. It should be noted that the deserialized value will always be wrapped in an Arc.

So, my questions are:

Could our re-insertion technique be causing issues with how obtain does its deduplicaiton, possibly causing a leak?
Is it appropriate to store Arcs in Foyer or do you think relying on the pointers you already manage is sufficient?

CC @fnichol

Nov 27 '24 00:11 sprutton1

Hi, @sprutton1 . Thanks for reporting.

I've checked your obtain() method usage. IIUC, if there are multiple concurrent obtain() calls, and a serialized value is returned, all of the callers will deserialize the value and reinsert the deserialized value into the cache. The memory usage of the concurrent deserialization would cause OOM. Besides, each reinsertion will lead to a disk cache write, which would consume more memory than expected. (Currently, foyer writes the disk cache on insertion, not memory cache eviction.)

Nov 27 '24 06:11 MrCroxx

BTW, have you setup the admission picker for the disk cache? It would be helpful to provide the foyer configuration. 🙏

Nov 27 '24 06:11 MrCroxx

Apologies for the delay.

All of our configuration happens in the same file. We do the work here. The defaults are set here. Let me know if this gives you any insight.

To be more clear, it seems like the memory is growing continuously, not necessarily that we're bursting into an OOM situation. Here's an example screenshot showing growth over a few days.

I suppose we could introduce locking around the get calls to block when we get a serialized value so we only do that work a single time.

Dec 02 '24 17:12 sprutton1

Hi, @sprutton1 . I found the admission rate limit is set to 1 GiB/s here:

https://github.com/systeminit/si/blob/1963e27a26adeb4f15877dda15458f92a6ab8e1e/lib/si-layer-cache/src/hybrid_cache.rs#L22

May I ask if the argument is expected? It is a little larger for disks without PCIe 4.0 or nvme support.

For debugging, if you are using jemalloc in your project, you can use jeprof to generate a heap flamegraph. Related issues and PRs: #747 #748 (Not much information with the links, sorry about that)

And, is there any way to reproduce it locally? I can help debug. 🙌

Dec 04 '24 03:12 MrCroxx

One more thing, would you like to integrate the foyer metrics in your env? That would help debug.

UPDATES: I sent a PR to upgrade the foyer version, with which you can use the new metrics framework. FYI https://github.com/systeminit/si/pull/5062

Dec 04 '24 03:12 MrCroxx

May I ask if the argument is expected? It is a little larger for disks without PCIe 4.0 or nvme support.

I landed on this number tinkering on my dev machine, which likely has faster disks than the machines we are running in production. I didn't put a lot of thought into it, to be honest. I can tune that setting to to match more closely with that environment, where I believe are using AWS EBS gp2 volumes (250mb/s max) for the cache disks.

What's the risk of tuning this too low? I assume items just won't get written to the disk portion of the cache if the rate is higher than the admission limit setting?

For debugging, if you are using jemalloc in your project, you can use jeprof to generate a heap flamegraph.

I was using heaptrack to the same end, but maybe we can get more detail using jemalloc/jeprof. I'll try that out today.

And, is there any way to reproduce it locally? I can help debug. 🙌

Thanks for the offer! The project readme has local setup instructions. The most reliable way I found to recreate the issue was to create a consistent amount of traffic on the site over the course of 30-60 minutes and look at flamegraphs to see where we were leaking. Our api tests can be adapted to this purpose.

Dec 04 '24 15:12 sprutton1

where I believe are using AWS EBS gp2 volumes (250mb/s max) for the cache disks.

In this case, the admission rate limiter should be set at a value close but below 250MB/s, e.g. 240 MB/s.

BTW. May I ask why you are using gp2? In my experiences, gp3 is always better than gp2 in both performance and pricing.

Dec 05 '24 06:12 MrCroxx

May I ask why you are using gp2? In my experiences, gp3 is always better than gp2 in both performance and pricing.

Just a legacy decision we haven't rectified yet. It's something I can probably fixup here. Once the other PR merges, I'll see what kind of telemetry we can pull out and come back with some numbers.

Dec 05 '24 14:12 sprutton1

I have some nice graphs set up and the rate limit set to ~240 mb/s. We're continuing to investigate on our side, but I will keep you posted if this helps alleviate our issues.

Dec 10 '24 20:12 sprutton1