java-driver icon indicating copy to clipboard operation
java-driver copied to clipboard

Eliminate lock in ConcurrencyLimitingRequestThrottler

Open jasonk000 opened this issue 9 months ago • 5 comments

Following up from #1957 this moves to a complete lock-free implementation.

(cc @tolbertam who reviewed the previous PR, and @akhaku and @clohfink who have reviewed internally).

Commits are broken to smaller steps in case it's helpful in review. I will do a squash/rebase when OK.

jasonk000 avatar Mar 03 '25 22:03 jasonk000

reviewed internally, LGTM

akhaku avatar Mar 03 '25 22:03 akhaku

nice @jasonk000! Looking forward to reviewing this, will try to give it a look this week.

tolbertam avatar Mar 04 '25 03:03 tolbertam

Thanks, @tolbertam, I've squashed it now. Ready to go!

jasonk000 avatar Mar 17 '25 16:03 jasonk000

Good catch, I've updated related comment & javadocs.

Separately, @adutra, we do not internally use the RateLimitingRequestThrottler. If there are other users, I would be happy to put some code together to do the same change, but would need some support to test and verify. Let me know if that would be useful.

jasonk000 avatar Mar 17 '25 18:03 jasonk000

If there are other users, I would be happy to put some code together to do the same change, but would need some support to test and verify. Let me know if that would be useful.

I'd say, let's wait until there is demand for that, but thanks for the initiative!

adutra avatar Mar 17 '25 18:03 adutra

@adutra, @tolbertam, I've updated the docs and squashed to a single commit. Do let me know if there's anything left to-do!

jasonk000 avatar Apr 09 '25 17:04 jasonk000

Hi @tolbertam, putting this on the radar as well - hoping it can make it for 4.19.1. Thanks!

jasonk000 avatar Jul 31 '25 18:07 jasonk000

thanks for bringing this to my attention @jasonk000 , i'll amend the commit to add the Patch; Reviewed by section and merge this shortly!

tolbertam avatar Jul 31 '25 19:07 tolbertam

Done, thank you again!

tolbertam avatar Jul 31 '25 19:07 tolbertam

@tolbertam I want to flag that we found an issue with this limiter implementation since you are considering release soon. It's taken a few months to show up, so is quite rare, but when it happens it does jam up the throttler very quickly.

The scenario is the logger throwing unanticipated exceptions (when we are nearing OOM) at a time which messes up the counter state. We can fix this by tracking and unwinding at the right moment and ensuring correct operation ordering before calling the logger.

Secondly, we found an issue that hampered the investigation -- since we are pushing more load now, the recursion in the request-done-and-submit-next flow is leading to very deep stacks, so we will be switching that to a loop to keep the stack manageable.

Draft is available but we have not yet tested it in real workload -> https://github.com/apache/cassandra-java-driver/compare/4.x...jasonk000:cassandra-java-driver:jkoch/throttler-depth?expand=1

jasonk000 avatar Sep 09 '25 16:09 jasonk000

thanks for the heads up @jasonk000 (FYI @absurdfarce)

The scenario is the logger throwing unanticipated exceptions (when we are nearing OOM) at a time which messes up the counter state. We can fix this by tracking and unwinding at the right moment and ensuring correct operation ordering before calling the logger.

Oh that sounds fun, OOMs causing runtime exceptions to be thrown in unexpected places can definitely screw things up in unpredictable ways. I suppose since this PR added some logging that might explain why hadn't seen this before? Although I do see there was logging previously in some places.

Secondly, we found an issue that hampered the investigation -- since we are pushing more load now, the recursion in the request-done-and-submit-next flow is leading to very deep stacks, so we will be switching that to a loop to keep the stack manageable.

Is that a new issue or something that would have previously existed prior to this PR? Has been a while since I looked at this PR so would need to context set, but don't recall this changing how things are called, so guessing that would have been a previous problem? Or is it something subtle in the queuing changes that contribute?

tolbertam avatar Sep 10 '25 13:09 tolbertam

I suppose since this PR added some logging that might explain why hadn't seen this before? Although I do see there was logging previously in some places.

Yeah, good question. The existing released code in 4.19.0 DOES contain a similar issue, however it is in a different piece of the code that is less likely to be triggered.

In the old code we have one spot that can trigger it.

If the logger throws here, then the concurrentRequests would never be decremented. If this happens enough, then you will eventually consume all slots.

In the new code we have one two three. You can see they are more directly in the line of fire and much easier to trigger. "Easier" of course is relative, as we've only seen it two times in seven months across all of our clusters!

You can see in this commit to fix it that we only have to enforce it in a few specific areas.


.. recursion .. Is that a new issue or something that would have previously existed prior to this PR

You are correct, this is a long-standing issue with a new potential fix and can probably be considered separate. Maybe I should put together a separate PR just for the smaller logger handler changes and deal with the loop/recursion one separate?

jasonk000 avatar Sep 10 '25 17:09 jasonk000