StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

StringIncrement() returning the same value on two different machines

Open kgidewall opened this issue 3 years ago • 9 comments

We are using StackExchange.Redis (v2.0.601) to connect to Azure Redis. We are using StringIncrement() to increment the value of a key while inserting a record into another storage system, and then stop inserting when the Redis value exceeds a certain number. This is running on multiple machines with each service multi-threaded. This is a high load scenario where we are hitting StringIncrement() with hundreds or even thousands of requests per second, spread across multiple machines.

What I am seeing is that we are actually inserting more records than we expect. When I log the values being returned by StringIncrement(), ** I see that we are getting the same value returned on different machines**. Here is the code we are executing:

returnedValue = AzureCache(clusterIndex).StringIncrement(cacheKey, quantity)

The list below is shows a count of the value returned and the count of them. |returned-value-after-increment|count-of-this-value-across-all-machines| So, the value 208 was returned to 4 different threads. The value 3 was returned to 3 different threads.

  | 208 | 4 |     | 207 | 4 |     | 211 | 4 |     | 209 | 4 |     | 206 | 4 |     | 210 | 4 |     | 205 | 4 |     | 213 | 4 |     | 212 | 4 |     | 3 | 3 |     | 4 | 3 |     | 215 | 3 |     | 2 | 3

I thought that StringIncrement() is atomic and could never return the same value to two different threads/machines. Can anyone explain what is going on and how to fix this issue? Am I misunderstanding how this should work?

kgidewall avatar Oct 12 '21 05:10 kgidewall

You're not misunderstanding; the increment is meant to be atomic. However, it is also a server-side primitive (incr/incrby), so all the library does is invoke that and return the result the server nominates. If something weird is happening: it is server side, so not really a question for "here", but some thoughts:

  • is this the only way you mutate this key? Do you ever decrement or set the value to something?
  • some kind of node instability? Failover? Split brain?
  • Something managed-service specific?

mgravell avatar Oct 13 '21 06:10 mgravell

More thoughts:

  • Is the key hardcoded? Can you produce a table of result counts that also includes the key?
  • Is there an expiration set on the key? Or an eviction policy that's causing the keys to be deleted and then recreated by INCR?

philon-msft avatar Oct 13 '21 13:10 philon-msft

You're not misunderstanding; the increment is meant to be atomic. However, it is also a server-side primitive (incr/incrby), so all the library does is invoke that and return the result the server nominates. If something weird is happening: it is server side, so not really a question for "here", but some thoughts:

  • is this the only way you mutate this key? Do you ever decrement or set the value to something? [Kenton] We do decrement in other places of the code, but I verified in the logs that that was not happening during this issue.
  • some kind of node instability? Failover? Split brain?
  • Something managed-service specific? [Kenton] These are both great questions, which I'm not sure how to investigate. Maybe Philo can help me with this part?

kgidewall avatar Oct 13 '21 16:10 kgidewall

Maybe Philo can help me with this part?

Well, it might be on Azure; but I wouldn't want to conclude prematurely that Azure is doing anything wrong. I imagine that it would help @philon-msft or myself or anyone investigate is: is there some minimal but reliable way of reproducing this?

mgravell avatar Oct 13 '21 16:10 mgravell

I was just able to repro it in our INT environment, so that gives me a lot more options for testing. I'll see what I can come up with to package something that is easily repro-able for you.

kgidewall avatar Oct 13 '21 18:10 kgidewall

Awesome. Can I ask: is your INT environment also cloud redis? Or is it on-prem? Just curious as to whether we can rule anything out already.

On Wed, 13 Oct 2021, 19:03 Kenton Gidewall, @.***> wrote:

I was just able to repro it in our INT environment, so that gives me a lot more options for testing. I'll see what I can come up with to package something that is easily repro-able for you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/1886#issuecomment-942578314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMCD5AMYH4XRCWGLIRTUGXCXXANCNFSM5FZVCXEA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mgravell avatar Oct 13 '21 18:10 mgravell

My INT environment is also cloud redis.

kgidewall avatar Oct 13 '21 23:10 kgidewall

More thoughts:

  • Is the key hardcoded? Can you produce a table of result counts that also includes the key?
  • Is there an expiration set on the key? Or an eviction policy that's causing the keys to be deleted and then recreated by INCR?

I added logging to show the cache key that are using and verified that it is always using the same key. I don't believe that there are any expirations or evictions on these. If there were, we would see the keys start over at 1, right? We aren't seeing that. They are consistently increasing, but duplicate incremented quantities are being returned.

kgidewall avatar Oct 13 '21 23:10 kgidewall

@kgidewall I had a thought here: are you using one of the Active-Active Redis multiple-primary hosts?

NickCraver avatar Aug 21 '22 14:08 NickCraver