orleans icon indicating copy to clipboard operation
orleans copied to clipboard

Treadpool starvation on reaching the grain directory cache size limit

Open krasin-ga opened this issue 2 years ago • 6 comments

It seems that the cache size is constantly increasing and when the limit is reached all the threads of the treadpool are busy trying to clear it.

.NET Thread Pool is exhibiting delays of 148.667305s. This can indicate .NET Thread Pool starvation, very long .NET GC pauses, or other runtime or machine pauses.
image

Hot path: image

Orleans version: 7.2.1

I tried updating to 7.2.3, but looking at the graph it seems to be affected as well:

image

krasin-ga avatar Nov 22 '23 15:11 krasin-ga

We're experiencing this issue as well. It seems to be related to the LRU implementation. The count property managed by the LRU class seems to be leaking. Once the LRU has reached its max size (or rather, when it thinks it reached its max size), inserts become a O(n) operation instead of a O(1) which in turn causes significant CPU spikes (we went up from ~2 cpu to maxing out the node at 32 cpu).

image

koenbeuk avatar Nov 24 '23 17:11 koenbeuk

LRU counting issue is tracked by #8741 Meanwhile this would still be an issue when legitimately reaching the grain cache limit.

koenbeuk avatar Nov 24 '23 18:11 koenbeuk

Is there any idea of eta on release of this fix? @ReubenBond

mrblonde91 avatar Nov 30 '23 16:11 mrblonde91

The counting fix has been released @mrblonde91. We intend to replace the LRU implementation entirely soon (likely in the next minor), which should also alleviate pressure

ReubenBond avatar Dec 07 '23 19:12 ReubenBond

We started seeing this issue in our prod environments. One silo would see a CPU spike (100%) and never go down until restarted.

We're at version 7.2.3. Here is a screenshot of one of our dotnet monitor trace output:

image

It looks like we're never going out of the while (Count > MaximumSize) in the AdjustSize method. Maybe because the:

if (entryGeneration <= targetGeneration)
{
    if (RemoveKey(e.Key)) RaiseFlushEvent?.Invoke();
}

portion is never called. I have no way of reproducing this behavior though.

fjoly-hilo-energie avatar Apr 05 '24 19:04 fjoly-hilo-energie

Here is a proof-of-concept cache built on the ConcurrentLru from BitFaster.Caching: https://gist.github.com/ReubenBond/c438867e9660407c0b71f5af2272aaf5

If you are able to try this with Orleans v8.1.0-preview3 or v7.2.6 (both include a required fix) and provide feedback, that would be most appreciated. This API proposal is based on the same library and we are trying to push that forward so that we can use it as the default in future.

An alternative workaround is to increase GrainDirectoryOptions.CacheSize to a larger value (eg, 10M instead of the default value of 1M).

ReubenBond avatar Apr 05 '24 22:04 ReubenBond