CCF icon indicating copy to clipboard operation
CCF copied to clipboard

Restrict historical query cache size

Open letmaik opened this issue 3 years ago • 6 comments

Currently, in-memory cached historical state gets evicted either because it expired (time-based) or because the handle has been manually freed, whichever happens first. If the memory consumption grows to the point where it exceeds the available enclave heap size, then the node would crash. This issue is about discussing an implementation plan to avoid this from happening.

One option could be to track the cache size (e.g. sum of KV key/value byte sizes) together with a configurable maximum size. If the maximum is reached, then the oldest cache entries would be evicted. I'm sure the devil is in the detail here.

letmaik avatar Nov 22 '21 17:11 letmaik

I sketched out an extension to our LRU to handle automatic size-based eviction for this kind of use case, but eventually convinced myself it wasn't a useful approach.

There are lots of ways that apps could allow the heap size limit to be exceeded, and I think the app has to be written to avoid that - a size-based auto-eviction is likely just hiding the issue, and removing control of it from the app. Currently an app writer should say "I'm willing to support 100 parallel historical query requests, and each can be requesting 1000 entries". Those counts are based on the memory budget they're willing to dedicate to historical query storage. When the 101st request comes in, the endpoint can decide what to do - does it prioritise the existing streams, and tell the new request that the server is overloaded? Or have some heuristic to prefer the newer request and evict the stalest? I don't think that's a decision we should make for the app writer. If we implement this with a size-based eviction, then the 101st request automatically evicts the stalest (or the smallest?) request, naively in a way that is opaque to the app. And in the worst case you have 101 users round-robining their requests and never getting an answer, because of a caching decision that we made.

Obviously that's a worst-case scenario and you can design an eviction API to mitigate. But I think those mitigations tend towards "give the app full visibility of the caching behaviour, and control over eviction choices". So if we implement this, I think it should be by exposing the current memory used by the historical query cache, per-handle, and implementing a size-based culling policy as an example of what an app could do. But not a static size cap within the cache itself.

eddyashton avatar Nov 23 '21 09:11 eddyashton

So if we implement this, I think it should be by exposing the current memory used by the historical query cache, per-handle, and implementing a size-based culling policy as an example of what an app could do. But not a static size cap within the cache itself.

I think this may slow down historic endpoints if the expectation is that on each request you iterate over all handles, get their sizes, add them up, possibly evict some, and then continue.

letmaik avatar Nov 24 '21 10:11 letmaik

@letmaik that's true regardless of whether it's the app or the framework doing it though.

achamayou avatar Nov 24 '21 10:11 achamayou

To be reviewed in 3.x, if it affects historical queries in JS Generic.

achamayou avatar Sep 30 '22 16:09 achamayou

@achamayou Now that we have JS runtime memory limit exposed which could avoid the node crashing, does it still make sense to have the historical queries cache size exposed as well for JS generic?

MahatiC avatar Nov 07 '22 09:11 MahatiC

@MahatiC yes, because the cache is allocated and kept on the C++ side, and isn't covered by JS sandbox limits.

achamayou avatar Nov 07 '22 09:11 achamayou