StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

Under low-memory conditions, StackExchange.Redis can return incorrect results to simple StringGetAsync calls

Open burns47 opened this issue 7 months ago • 14 comments

This is another report of #2804

Environment details:

  • We're seeing this on StackExchange.Redis 2.8.16, but I've reproduced on 2.8.37.
  • We're running dockerized .NET 8 on linux FROM mcr.microsoft.com/dotnet/aspnet:8.0
  • I'm running against an AWS Elasticache Redis cluster with these options: clustercfg.mycluster.amazonaws.com:6379,password=REDACTED,ssl=True,abortConnect=False

I'm hoping I can help identify this issue because I've attached a reproduction to this issue. It's not a perfect repro - you may have to run it several times (15+ for at least 30 seconds each) before seeing the bug. It may also OOM and you may have to restart it.

The TLDR of the repro is:

  • I'm using 1000 async Tasks to call StringGetAsync and StringSetAsync with a GUID key and the same GUID as the value
  • I'm using a separate thread to consume as much process memory as I can by stuffing large int[]s into a static list.
  • There's one additional thread that's attempting to read and write large values (100KB) to and from redis
  • The process memory is limited to 2GB. That's configured in launchSettings.json, so you may need to make sure you're using the correct launch profile

When the bug occurs, most or all calls to StringGetAsync return a value from a different call. In some cases the connection heals itself, but in other cases (including in production) we've seen the connection stay broken despite the memory pressure going away, and the calls continue to be mixed up. In our production scenario, this resulted in mostly deserialization errors (since the data returned did not match the format we were expecting), but in some cases resulted in data being returned to clients that did not own that data (for mixups that happened to have the correct format)

The program will report exceptions from the GUID-writing processes and the large value writer. It will also report when the GUID-writing processes get values they do not expect. When this happens, it happens in big bursts:

Image

Any help with this issue would be appreciated! From my experimentation it seems that HighIntegrity mode helps, but...why? How can memory pressure lead to what seems to be a protocol desync?

RedisRepro.zip

burns47 avatar May 15 '25 19:05 burns47

That's very weird. I seem to have secured some dedicated library time, so I'll see what I can find. Interesting!

mgravell avatar May 15 '25 21:05 mgravell

I'm struggling with repro on this, which might be in part because I have a stupid amount of RAM on this machine; I have repro'd once in x64 and once on x86, but the error is... weird:

Example 1:

Received f3964cd1-37a4-4b0c-83bc-6243f2c22c44, but expected f3964cd1-37a4-4b0c-83bc-6243f2c22c44

Example 2:

Received 56b98a63-ab74-495b-8426-b50abb4e2b98, but expected 56b98a63-ab74-495b-8426-b50abb4e2b98 Received acf15a79-460f-4ef0-8b39-83d1490c23a0, but expected acf15a79-460f-4ef0-8b39-83d1490c23a0 Received ddd02278-307f-45bf-8ca8-4ed1def3b282, but expected ddd02278-307f-45bf-8ca8-4ed1def3b282

(the weird thing being: the left and right sides are equal; this is not an off-by-one desync failure)

To me this suggests some weird timing issue, however: I'm very concerned by whatever is going on here, and I'm continuing to investigate - I just wanted to post an update.


adding Example 3:

Received cea912c6-c865-4313-ba6d-1d15ff6bec18, but expected cea912c6-c865-4313-ba6d-1d15ff6bec18 Received b70a6d35-a047-4664-83a8-fd30054504b9, but expected b70a6d35-a047-4664-83a8-fd30054504b9 Received cd17c451-aecc-4085-921b-e701fd0ba9b6, but expected cd17c451-aecc-4085-921b-e701fd0ba9b6 Received 5d3effaf-2b4d-4d53-8e2a-d9eaa19c3cba, but expected 5d3effaf-2b4d-4d53-8e2a-d9eaa19c3cba

mgravell avatar May 19 '25 14:05 mgravell

Thank you for the quick update! I don't recall seeing that behavior on my machine (windows 64 bit, 32GB RAM) and I can't imagine what would cause that!

I almost didn't post this because the repro is very inconsistent on my machine as well, and I tried playing with many different combinations of memory consumption / number of readers & writers / sizes of cache members and was never really able to get it to be more consistent. Let me know if you need any more details!

burns47 avatar May 19 '25 16:05 burns47

I would speculate that this is the cause of the thing that made us add "high integrity mode", so I'm very motivated to keep digging here

mgravell avatar May 19 '25 19:05 mgravell

I am facing the same issue in one of my apps. We are using StackExchange.Redis version 2.6.70. The app is running on kubernetes with multiple pods. The docker image is mcr.microsoft.com/dotnet/aspnet:6.0-alpine.

This did happenend on february of this year, but at the time we weren't able to pin point the issue as comming from the lib returning the wrong values, so we implemented some guards to throw exceptions, register logs and metrics to try catch the issue before it causes colateral damege as it did the first time. One of the guard checks if the values returning from redis matched the key that requested them.

Well, that guard blow up yesterday.

the first thing we see is a few System.OutOfMemoryException then, two minutes after we start seeing lots of StackExchange.Redis.RedisConnectionException:. This two variations: StackExchange.Redis.RedisConnectionException: Unexpected response to UNLINK: SimpleString: string StackExchange.Redis.RedisConnectionException: Unexpected response to TYPE: Integer: 1

There is also some deserialization errors, that probably happened when key request returned a json string that did not matched the target type.

The worst for us is that the majority of the gets this app does are for keys of the same type of item, so there is a lot of wrong responsed that actualy fit the target type on the deserialization, leading to a silent error (the guards mentioned above avoid this errors now).

At the time of the occurrence we had 6 pods of this app running, only one of them presented this behavior.

I am updating the lib to version 2.8.37 and setting the highIntegrity flag to true as mentioned on the issue #2804.

Question: when this happen again and the flag highIntegrity is enable, the connection will just throw a RedisConnectionException or the lib has specific exception type for this case? I would like to catch when this specific error occur to emit some warning so that we can take some action.

jamir-araujo avatar Jun 06 '25 20:06 jamir-araujo

I'd need to check the exception.

Re the rest: I think we can conclude that there's definitely a gremlin in here somewhere. I'm working on this as one of my main things currently.

mgravell avatar Jun 06 '25 21:06 mgravell

If there's anything I can do to help (maybe run some kind of debug build and give you some log output when I'm able to repro?) let me know. HighIntegrity is enabled on our environments but at this point I'm emotionally invested : )

burns47 avatar Jun 06 '25 21:06 burns47

Well: we've had difficulty with hard repro locally, but I have a plan; if you are currently able to observe the problem (albeit intermittently), maybe the most useful thing is to try future changes (when available) to see if it goes away!

mgravell avatar Jun 07 '25 20:06 mgravell

Out of curiosity, any update?

petrroll avatar Jun 28 '25 20:06 petrroll

Not at current time, but: this is a priority and understood as a serious item, and I'm in the process of changing roles (and employer) which should make fixing things like this a part of my day job, so: it will be addressed

mgravell avatar Jun 29 '25 07:06 mgravell

At the risk of just adding more noise to this issue - I think I can confirm that HighIntegrity=true fixes this problem. We had a similar event happen in our prod environment, but this time we got:

stackexchange.redis.redisconnectionexception: ResponseIntegrityFailure on REDACTED.cache.amazonaws.com:6379/Interactive, Idle/ResponseSequenceCheck, last: GET, origin: ResponseSequenceCheck, outstanding: 55, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 7 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.8.16.12844

Our monitoring that detects whether the redis response matches our expectations did not trip.

burns47 avatar Jul 29 '25 21:07 burns47

Any update? Cc: @mgravell

Is the assumption that the recent perf rewrite will solve it?

petrroll avatar Sep 08 '25 19:09 petrroll

Yes. Exactly. The gnarly bug is somewhere is the guts of arenas or pipelines. The rewrite sidesteps all of that completely. Progress is going well - right now it is rock-solid stable and performance on single-connections, and I'm piecing together the routing core. I hope to have an early public build maybe this week.

mgravell avatar Sep 08 '25 19:09 mgravell