ConnectionMultiplexer holding references to 4+ GB memory
We have been having a few instances of our machines low on memory so we were doing a bunch of investigations. In one of the investigations, something peculiar showed up which I could not make a whole lot of sense of.
Here is what dotMemory shows for a specific memory dump.
The string problem we have here, we are aware of it and working on fixing (it doesn't concern caching). The concerning part is 4.04 GB retained by ConnectionMultiplexer

I tried looking deeper into it. It looks like the Redis SDK holds a long Queue of messages

Wanted to understand if this is expected or are we seeing\hitting a bug somewhere, either in the way we are setup OR the way SDK behaves. Also how to interpret this retained memory snapshot in general.
We are using .Net Core 3.1.x Redis caching extension lib which uses StackExchangeRedis 2.0.593
I'm very surprised by a queue of that length. Does the tooling indicate which field this is rooted by? I'm presuming that this is the "sent, pending result" queue, but: that really shouldn't be 2M long! We could perhaps put a max size on this backlog queue. Any idea what was happening more holistically here? High throughput? Disconnect? Clues...
On Fri, 18 Nov 2022, 20:28 Ramjot Singh, @.***> wrote:
We have been having a few instances of our machines low on memory so we were doing a bunch of investigations. In one of the investigations, something peculiar showed up which I could not make a whole lot of sense of.
Here is what dotMemory shows for a specific memory dump.
The string problem we have here, we are aware of it and working on fixing (it doesn't concern caching). The concerning part is 4.04 GB retained by ConnectionMultiplexer
[image: image] https://user-images.githubusercontent.com/13517857/202769497-fee8a4e5-57c8-41cf-98de-3761d576096a.png
I tried looking deeper into it. It looks like the Redis SDK holds a long Queue of messages
[image: image] https://user-images.githubusercontent.com/13517857/202768673-d97f9df4-e0ea-4e9e-930e-d67f7c68e2d5.png
Wanted to understand if this is expected or are we seeing\hitting a bug somewhere, either in the way we are setup OR the way SDK behaves. Also how to interpret this retained memory snapshot in general.
We are using .Net Core 3.1.x Redis caching extension lib which uses StackExchangeRedis 2.0.593
— Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/2307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMHEGQ6QQCMKXMLVRF3WI7RHFANCNFSM6AAAAAASE3IKB4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I am not sure honestly and kinda stumped. I do have access to the individual values too (since this is coming from memory dump). anything in particular I could look for? I had a random hunch that maybe the Value of the fetched key was pushed to LOH but that isnt the case either (confirmed in code too it should never go to LOH).
Is there any chance you have a timeout error when this happens? The text in that might help us advise quickly.
I looked a bit deeper and don't really see any timeouts. I am assuming that I need to look in a window of days. I see few failures.
The general trend is No connection is available to service this operation.
Sample error
Exception 'No connection is available to service this operation: HMGET <RedactedKey>; IOCP: (Busy=1,Free=999,Min=400,Max=1000), WORKER: (Busy=183,Free=32584,Min=400,Max=32767), Local-CPU: n/a'. StackTrace 'StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: HMGET <RedactedKey>; IOCP: (Busy=1,Free=999,Min=400,Max=1000), WORKER: (Busy=183,Free=32584,Min=400,Max=32767), Local-CPU: n/a',
@RamjotSingh That 183 worker threads busy indicated thread pool overload, which would result in a ton of backup happening. Are you doing synchronous calls overall? (I'm about to add a new stat for this). In any case: I can tell from the info here this is a very old version of the library - you'll want to upgrade here because a lot of changes have been made and we can't really solve anything for an old version. The debug info available will be far greater as well.
We run at a rather high requests per second so 183 worker threads isn't that a high a value for us (our min limit is at 400).
Regarding the SDK yes as mentioned we don't directly pull StackExchange.Redis but pull it through NetCore 3.1.x.
That being said if we were to upgrade from 2.0.593 to the latest one, should we anticipate or be on lookout for any breaking changes?
@RamjotSingh can you quantify "rather high" here? What numbers are we talking on the machine, and how many cores?
To answer your other question: there are no breaking changes known - we try to follow SemVer here.
These are typically 16 core machines taking abt 400-500 requests in parallel on a machine. But machine is running a lot more services (and doing a lot of processing too). We can approx that a particular machine running this particular process would be at abt 100 requests per second.
I am upgrading us to latest SDK version.
@RamjotSingh Your thread count is extremely high, likely due to either too much load on the CPU in general or synchronous calls that are stalled waiting on a lock or an I/O operation to complete. I'd recommend using your memory dump to see where all those threads are in their stacks. This looks like a pile-up from every standpoint.
I looked at the memory dump and only 13 threads were active (our min is set to 400 so .net wont clean them up). I did not see any other stalling etc. We are upgrading the package to latest because we saw this same thing happen on a bunch of memory dumps (not always but multiple cases where GBs worth of references are kept).
This might be related to #2070. Have you looked at a memory dump to see if you have the a large amount of RawResult[] objects taking up live and dead memory?