Treadpool starvation on reaching the grain directory cache size limit
It seems that the cache size is constantly increasing and when the limit is reached all the threads of the treadpool are busy trying to clear it.
.NET Thread Pool is exhibiting delays of 148.667305s. This can indicate .NET Thread Pool starvation, very long .NET GC pauses, or other runtime or machine pauses.
Hot path:
Orleans version: 7.2.1
I tried updating to 7.2.3, but looking at the graph it seems to be affected as well:
We're experiencing this issue as well. It seems to be related to the LRU implementation. The count property managed by the LRU class seems to be leaking. Once the LRU has reached its max size (or rather, when it thinks it reached its max size), inserts become a O(n) operation instead of a O(1) which in turn causes significant CPU spikes (we went up from ~2 cpu to maxing out the node at 32 cpu).
LRU counting issue is tracked by #8741 Meanwhile this would still be an issue when legitimately reaching the grain cache limit.
Is there any idea of eta on release of this fix? @ReubenBond
The counting fix has been released @mrblonde91. We intend to replace the LRU implementation entirely soon (likely in the next minor), which should also alleviate pressure
We started seeing this issue in our prod environments. One silo would see a CPU spike (100%) and never go down until restarted.
We're at version 7.2.3. Here is a screenshot of one of our dotnet monitor trace output:
It looks like we're never going out of the while (Count > MaximumSize) in the AdjustSize method. Maybe because the:
if (entryGeneration <= targetGeneration)
{
if (RemoveKey(e.Key)) RaiseFlushEvent?.Invoke();
}
portion is never called. I have no way of reproducing this behavior though.
Here is a proof-of-concept cache built on the ConcurrentLru from BitFaster.Caching: https://gist.github.com/ReubenBond/c438867e9660407c0b71f5af2272aaf5
If you are able to try this with Orleans v8.1.0-preview3 or v7.2.6 (both include a required fix) and provide feedback, that would be most appreciated. This API proposal is based on the same library and we are trying to push that forward so that we can use it as the default in future.
An alternative workaround is to increase GrainDirectoryOptions.CacheSize to a larger value (eg, 10M instead of the default value of 1M).