aws-dotnet-session-provider Session locking -> increase conditional write failures -> unrecoverable throttling

Hi there,

We are using AWS .NET SDK Session Provider and occasionally experience a problem that proceeds as follows:

Some part of the system causes a temporary slow down in response time
Users appear to hit F5 or trigger multiple AJAX requests or open windows in multiple tabs which cause multiple parallel requests. As the first slow request is holding a lock on the session, the multiple parallel requests are blocked during AcquireRequestState and are polling DynamoDB every 500ms - it appears from code inspection that each request would receive a Conditional Write Failure from DynamoDB on the first attempt, and then consume Read capacity on each successive poll.
Once enough users are in this state, we start to experience throttling - which we believe is happening at a partition/shard level (as we are scaled up to 300 read / 3000 write units for a short period during our peak period, our understanding is we may have 3 or more shards).
Our ops team manually increases dynamo provisioned throughput at this time, but it does not seem that any amount is enough to recover.

Once we hit stage four (which can happen within 15 minutes of the initial root cause) the only way we have found to recover is to kill all the instances in our ASG and let them start up again - it seems that some of them end up with background threads spinning and polling dynamo and consuming as much provisioned capacity as we can throw at it.

I have reviewed the source code in this repo, and it does not appear that anything in here is responsible for this, so it seems like an interaction effect between the default .NET Session Store logic (which controls the poll retry rate etc.) and the DynamoDB Session Provider's use of conditional failures to manage the locks.

So a few questions:

Has anyone else experienced anything like this?
Is anyone aware of what could be causing background threads to spin and poll Dynamo attempting to obtain the lock?
Does anybody with a more in-depth understanding of .NET Session Providers and the AWS DynamoDB Session provider in particular that might shed light on this problem?

This has caused too many incidents for us now and we are on the verge of abandoning DynamoDB as a session store, but I am really hoping a last ditch effort we can salvage it, as it does seem like a natural fit for the workload.

Thanks in advance, Chris.

Jun 25 '15 07:06 chrissimon-au

Hi Chris,

We've experienced this exact problem three times in the last two weeks. We've raised it with Amazon, as the downtime plus cost to run massive DynamoDB overprovisioning is pretty bad.

Did you get any reply from them, or find any other workarounds? We'll let you know if we find anything.

Cheers,

Robin

Sep 23 '15 12:09 robinmessage

Hi Robin,

Unfortunately we never got a reply, but we did resolve it in a manner of speaking - the cause of the problem ended up being the "multiple parallel requests" referenced in stage 2 of my first comment. The trick was identifying that the bulk of these parallel requests were simply a few spots in our app where the client side javascript submitted 2 or 3 concurrent AJAX requests.

2 or 3 doesn't seem like much, but it adds up. The first request locks the session, the 2nd and 3rd then each both carry out the following sequence (in normal operation):

attempt to obtain the lock, fail, sleep 500ms (DynamoDB Write -> ConditonalUpdateFailure)
check the lock status, discover it is now unlocked (DynamoDB Read)
obtain the lock (DynamoDB Write)

And this is during normal operation. Depending on the frequency of the operation, this can greatly increase your provisioned level requirement.

When the first request is delayed due to any other slowdowns (database, cache, cpu or whatever the cause), then the 2nd and 3rd requests will look more like:

attempt to obtain the lock, fail, sleep 500ms (DynamoDB Write -> ConditonalUpdateFailure)
check the lock status, it's still locked, sleep 500ms (DynamoDB Read)
check the lock status, it's still locked, sleep 500ms (DynamoDB Read)
repeat * N
check the lock status, discover it is now unlocked (DynamoDB Read)
obtain the lock (DynamoDB Write)

This rapidly ramps up your read consumption by a factor of NumBlockedRequests * 2 (polls per second) * NumSecondsBlocked.

If your app is such that users might go ahead and initiate a 2nd or 3rd operation on the same page (as ours was) then that spawns the above process all over again increasing the number of ConditionalUpdateFailures and DynamoDB reads proportionally.

Even worse, once you have a large number of parallel requests competing for the lock, it's possible that between step 5 and 6 in the above list, the lock gets obtained by another request, which is only discovered by another ConditionalUpdateFailure when attempting to obtain the lock in step 6 and the process repeats again.

We had a situation where we had 2,000 users on the site at one time, and within 10 minutes of a very small slowdown at the redis cache layer we had over 650,000 dynamodb operations per minute being attempted - almost all of which failed with a 400 throughput exceeded error.

Ultimately, the problem here is that the .NET Session locking model uses pessimistic concurrency with the assumption that attempting to obtain a lock (or check if the lock is free) is a cheap operation. With DynamoDB it is actually quite costly (literally) due to the provisioned throughput model.

At this point, we decided we had a few options:

Re-architect the application to not require session storage - discarded due to the magnitude of this task (a 10-year old legacy app)
Implement a custom session provider for dynamodb that does not lock the session. We discarded this as the risk of dirty data in the session was too high for our app
Implement a custom session provider that uses optimistic concurrency. We discarded this as we couldn't find a model whereby handling the detection of a concurrency exception could lead to a good user experience
Switch session stores to a session store with cheap locking and lock status check operations (such as a redis cache). We seriously evaluated this, but eventually discarded it as our scalability requirements made us nervous about being constrained by a fixed size session store. Even with the ease of adding new elasticache services, it's still a lot more work than scaling DynamoDB
Modify our client side code to not issue concurrent requests - effectively lock the session at a higher level, so that when requests hit the server they are almost never blocked by an existing request.

We ended up choosing the last path. We implemented a basic javascript lock object and wrapped the relevant Ajax calls in the lock. The ajax call is passed in as a callback to the lock object - the lock object calls the callback once it detects the lock is free. If it's not free, it uses setTimeout on a very short interval to trigger another lock check.

The impact was significant. Our app hasn't had a repeat of this issue since implementing this technique, and it has experienced a few lower level 'slow down' events. The site slows down, then the lower level recovers, and the site gets back to normal - what you want to have happen!

In addition, all the parallel requests are now much faster - if the request operation itself normally takes 100ms, if it was actually the 2nd or 3rd request issued in parallel it was actually taking 600 or 1,100ms in user experience time: 500ms lock poll sleep * 1 (or 2) + 100ms operation time. Now, all 3 operations are complete within 300ms - one after the other.

You may find you have a different root cause, but from our experience I'd start by looking for any parts of your app where it is normal to issue multiple parallel requests to the server and see if you can avoid doing that.

Good luck!

Chris.

Sep 23 '15 23:09 chrissimon-au

Hi Chris,

Thanks very much for the detailed reply! We were looking at options 1-4; your option 5 is something we'll consider but weirdly, we don't do that much AJAX, so I'm not sure we'll be able to find the cause like that.

For now, we've done 2, as ironically our sessions don't change much, nor store data where we're worried about it being dirty.

Thanks again, if you're ever in London I'll buy you a beer.

Robin

Sep 24 '15 08:09 robinmessage

Hi Chris and Robin

I just wanted to let you know, this isn't going unheard. I am taking another look at the design of the session provider to see if there is something we can do to lower the impact of the ASP.NET pessimistic locking design.

Sep 24 '15 17:09 normj

Thanks @normj - this would be make a big difference to us!

Sep 24 '15 17:09 sjeh

Thanks @normj. We are considering writing a provider that gives as much safety as possible without any locking; I'll see if we can open source it afterwards as an alternative for others. One tricky part is sliding the expiry forwards as the session is used with the minimal number of writes. We'll let you know how it goes.

Sep 24 '15 18:09 robinmessage

@robinmessage - no problems, hope it helps!

If you don't do much AJAX that is a bit more mysterious. For it to be related to session locking, there must be some process by which a second concurrent request is issued while the first request is still processing. Depending on your average response time, and user expectations and user flow, without AJAX requests, I can only assume that users are expecting to be able to click through to a second page while the first is still processing. Alternatively it could happen if you have an API that is dependent on session and an API client is issuing multiple concurrent requests on the same session

There are some other possibilities unrelated to session locking. The New Relic APM tool is extremely helpful at isolating this issue in particular and might help rule it out. If you have access to it, look at the application request response time histogram. If you see a peak frequency response time, and then a second smaller peak 500ms later, then the requests included in the second peak are frequently being session blocked by the requests in the first peak.

The following images illustrate the impact on the frequency histogram and request time (from the chrome developer tools network tab).

With locking sessions due to parallel requests: With Locking - Frequency Histogram With Locking - Response Time/Sequence

Without locking after serializing the requests using a javascript session lock: With Locking - Frequency Histogram With Locking - Response Time/Sequence

Note that the above symptoms would relate to any .NET application using the Session - the only reason it causes a problem with DynamoDB is the impact on consumed throughput and risk of throttling.

If you don't have New Relic, and the absence of AJAX requests make the chrome dev tools network tab analysis harder, look for a marked increase in Conditional Request Failures in the Cloudwatch monitoring for the Dynamo table, just prior to the onset of throttling, as each conditional request failure is likely to indicate a request attempting to obtain a locked session.

If you don't spot either of these symptoms, it may not be session locking related, and your non-locking session provider may not end up helping.

Another possible root cause is a more subtle leaky abstraction of the Dynamo provisioned throughput model - partition-level throttling - http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions has more details. If you scale out and in, your scale out may cause extra partitions, but your scale-in will not necessarily remove them. The impact is that your low level of provisioned throughput is spread out over a large number of partitions, leading to a very low per-partition provisioned throughput, so the risk of per-partition throttling goes up. If you see this issue happening under low load, but shortly after a high load scale out, this could be implicated. The symptom is evidence of throttling whilst the total consumed throughput is still below the provisioned throughput level.

Unfortunately the only workaround we found for this issue was to rotate the table (drop/create).

Best of luck with it!

Sep 24 '15 19:09 chrissimon-au

@normj - thanks for the response - hopefully your efforts are fruitful! For what it's worth, IMO the use of the conditional update to lock the session is a very elegant usage of that DynamoDB facility - it's just a shame about the implications of the .NET polling algorithm!

If I may offer some thoughts I've had for different areas to explore...

Alternative distributed lock implementation
- Leave the session data in DynamoDB, but find an alternative distributed lock. Not sure what that could be yet (whilst preserving the scalability and availability offered by DynamoDB), but as the only issue is polling the lock, it seems like DynamoDB is not a good resource for holding a distributed lock in this context.
Implement exponential backoff-retry
- Violate the .NET session provider implied requirements, but if you could maintain in process state regarding how long the current request has been blocked by a session lock, you could fudge some poll operations from the .NET layer with a 'still locked' message without actually hitting dynamo. Obvious negative impact on the response time as the lock duration grows, but might stem the 'runaway' dynamo throughput consumption.
Reference count the number of locked requests
- Not sure what you could do with this info, and the extra operations to increment and decrement may offset any gain, but perhaps as the number of locked requests goes up, alter your poll strategy, again to protect dynamo from runaway throughput consumption.

Sadly, none of the above are really solutions... just some thoughts on possible directions for exploration to spark the creative juices :)

Good luck with it!

Sep 24 '15 20:09 chrissimon-au

I have been looking at the code trying to find alternatives. I understand the pessimistic lock framework that ASP.NET has defined can really eat up the capacity for the DynamoDB table but I'm hesitant to try and force some sort of optimistic locking scheme in a framework that it wasn't meant for. There would be just too many cases of undefined behavior when we attempt to merge competing session states.

An option i have been debating about which I would appreciate getting feedback on is use a combination of SNS and SQS to communicate the state of the lock for a session. Basically every running instance of the session provider would have an in memory cache of the current locks and would poll from a SQS queue when locks change. The queues would be subscribed to a common SNS topic that each process would send messages to when the state of a lock changes. This would be an opt in option.

The pros is it would move the polling to SQS which is what the service is great at and not eat up the capacity on the table. The cons are it is more difficult to get setup, although we should be able to ease that during the session initialization, and your application would require more permissions to run. There would also be additional cost for the topic and queues but that should be less then DynamoDB capacity.

Oct 01 '15 23:10 normj

Hi @normj,

The SNS SQS idea is interesting - my initial reaction to the concept is around two areas:

latency on lock status updates
- in other usage of SNS and SQS I've seen latency ranging from seconds to minutes on messages, which could be an unacceptable delay on lock status notification, especially if a request is waiting on an unlock (it might time-out and claim the lock while the lock status notification is en-route).
scalability of in-memory lock cache
- Maintaining an in-memory cache of lock status on each host is an interesting idea, but would need to understand the per-host resource (memory) usage profile under scale.

I agree, optimistic locking doesn't seem doable, as there's no sensible way to recover from a concurrency failure given the location of the session write in the asp.net request lifecycle.

I think there might be applications (such as Robin's) that could benefit from a non-locking implementation - MS even includes a non-locking session implementation in their sample documentation, so perhaps an initial offering could be a configuration switch to enable or disable locking for those use cases where locking is not required (default enable).

Adding some notes to the ASP.NET SDK Session Provider for DynamoDB documentation page on the potential impact of lock polling on DynamoDB provisioned capacity utilisation might also help new players.

Cheers, Chris.

Oct 07 '15 20:10 chrissimon-au

Hey all,

We found another factor recently that can be related to this situation - session size. We found a bug in our app that was causing some sessions to be much larger than others - i.e. 154kb or even > 300kb in some cases. While the average is still low, we wondered if these occasional large sessions might cause dynamodb to exceed it's burst capacity causing throttling on a shard and consequent impact on other sessions on that shard. We're still investigating that (and will of course fix the bug leading to the rogue session size), but it prompted another thought:

You could add an optional or dynamically selected compression option to compress the binary serialized sessionstateitems, and then base64 the compressed stream. Even for sessions of only a few KB, this could drop the size enough to have a reasonable-significant impact on the consumed throughput.

Just a thought!

Cheers, Chris.

Dec 03 '15 20:12 chrissimon-au

Hi Chris,

We started gzip-ing our sessions ages ago because it would cut our costs significantly, so we agree this seems like a good thought, and something it'd be nice for the store to do transparently.

We've had no issue since we stopped doing any locking, so that seems like the culprit to us.

Cheers,

Robin

Dec 04 '15 09:12 robinmessage

Interesting idea to have the session provider having an option to turn on compression. I'd want to avoid taking on a third party dependency for the compression so I would only add this to the .NET 4.5 version since that is when compression was added to .NET.

Dec 04 '15 18:12 normj

Thanks guys - an interesting discussion.

Hi @normj ,

Do you have any update on your progress on this design of the session provider?

Cheers, Brody

Mar 15 '16 20:03 brodykenrick

Closing for staleness.

May 10 '21 16:05 ashishdhingra

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

May 10 '21 16:05 github-actions[bot]

@normj any chance this can be reopened? It is still an issue

May 25 '21 10:05 afinzel

This is the oldest feature request on SessionProvider. We did a recent triage on this ticket - the effort to address this request is very large and the associated SDK version is old too. We will NOT implement this feature request.

Apr 07 '22 18:04 hunanniu

I am sorry but this makes this whole project unusable. This causes regular loops constantly trying to write to dynamo that causes downtime. We are looking at moving off of this session provider as it isn’t good on production.

Apr 07 '22 19:04 afinzel

As the original raiser of this ticket, I agree with @afinzel - the project is unusable - for at least moderate volume or production workloads.

I appreciate the difficulty of resolving it - the reality is that DynamoDb is just not a great storage medium for .NET Framework ASP.NET session data due to the default session-locking and session-state polling algorithm.

My request is that at the very least, the documentation page be updated with a warning that this provider is only suitable for non-production or low-volume usage due to high impact unsolvable performance issues.

Realistically few people are starting new projects in .NET Framework - this is mostly going to continue to bite teams who are migrating legacy .NET Framework apps to AWS and aren't yet ready to update to ASP.NET Core. There are a lot of them, and they probably all have session state, and they deserve a warning that this session provider is probably not fit for purpose.

Apr 11 '22 07:04 chrissimon-au

Hi @chrissimon-au,

We would recommend to use the AWS .NET Distributed Cache Provider moving forward.

Regards, Chaitanya

Jul 19 '24 16:07 bhoradc

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

Jul 19 '24 16:07 github-actions[bot]

aws-dotnet-session-provider aws-dotnet-session-provider copied to clipboard

Session locking -> increase conditional write failures -> unrecoverable throttling

⚠️COMMENT VISIBILITY WARNING⚠️

⚠️COMMENT VISIBILITY WARNING⚠️

aws-dotnet-session-provider
aws-dotnet-session-provider copied to clipboard