returnPermits exception stalls connection pool
We are seeing an issue with drainLoop in SimpleDequePool that is similar to this issue. We see the following pattern:
- Elevated errors over a period of hours (in our case, we see a high count of StackOverflowErrors). We are still attempting to understand the cause of these errors.
- We observe a
Too many permits returnederror in our logs. In our case, the connection pool is configured with 500 connections, so the error isToo many permits returned: returned=1, would bring to 501/500.. - Idle an active connections go to 0 (observed via metrics) and pending connections continue to build (we've seen many thousands).
- No more outgoing request are observed.
We have manually attempted to reproduce this issue by implementing a custom AllocationStrategy that delegates to an actual allocation strategy but throws an exception in returnPermits (either randomly, or after a given number of calls). After this exception is thrown, we observe the same behaviour as above. We don't currently understand how reach a state where the PERMITS count is off (we are unable to enable DEBUG logging in production due to PII concerns, but are looking to deploy some changes that will add logging on the number of permits as well as the state of the connection pool. Any pointers you have would be appreciated).
One hypothesis is that the exception is not handled, and WIP is not decremented. When doAcquire() is called, elements are added to the pending queue as before, but drainLoop cannot be entered as WIP will not be 0.
Another option is that it's hitting this exception in destroyPoolable.
Expected Behavior
Ideally, the connection pool would continue to function. I don't know if that is realistic given that PERMITS would continue to be wrong.
Actual Behavior
No more connections are made.
Steps to Reproduce
We realize this is a contrived example, but matches what we end up seeing in production.
class ThrowingAllocationStrategy(...) {
private val allocationStrategyDelegate = // your actually, allocation strategy here, eg. Http2AllocationStrategy, SizeBasedAllocationStrategy
// implement the remainder by just delegating the calls to the underlying allocation strategy
override fun returnPermits(p0: Int) {
randomlyThrow()
allocationStrategyDelegate.returnPermits(p0)
}
private fun randomlyThrow() {
if (Random.nextInt(100) == 0) {
throw RuntimeException("Boom!")
}
}
}
<SNIP>
// init custom connection provider with the custom allocation strategy
return ConnectionProvider
.allocationStrategy(loggingAllocationStrategy!!)
.build()
Possible Solution
Given the the PERMITS count is off, it's unclear to me what the proper resolution would be. We noticed that disposing of the connection provider did resolve the issue as expected, presumably because the connection pool would be fresh and not in a bad state. I'm not sure what the implications would be of continuing to use the existing connection pool, as I would expect to see the permits exception being thrown repeatedly.
Your Environment
reactor-pool == 0.2.13
io.projectreactor:reactor-core:3.5.20 -> 3.6.9
org.springframework:spring-webflux -> 6.1.12
io.projectreactor.netty:reactor-netty-core:1.1.22
netty:4.1.112.Final
Also tried the latest main commit for reactor-core and reactor-pool.
- Other relevant libraries versions (eg.
netty, ...): netty, netty == 4.1.x, - JVM version (
java -version): 21, 17. - OS and version (eg
uname -a):
@erikbeerepoot Something is off here, Reactor Netty 1.1.x uses Reactor Pool 1.0.x. https://github.com/reactor/reactor-netty/blob/1.1.x/gradle.properties
Also Reactor Pool 0.2.x reached OSS EOL https://github.com/reactor/.github/blob/main/SUPPORT.adoc#support-timeline
@violetagg Interesting, that is unexpected for sure. I will look into why this version is being used.
EDIT: Turns out the wrong reactor-pool version was an artifact of some local debugging steps I took, but doesn't reflect the version in the application.
Taking a glance at the current code though, it's similar enough as to have the same issue if this exception were to be thrown. I imagine it's pretty rare.
To close the loop on this -- the above is definitely a bug in the reactor-pool code. What was happening in our case, is that we were adding too many onErrorDropped hooks in our application (a bug in our code). When enough hooks are added, this recursive call results in a StackOverflowError. This SOE appears to have exposed multiple bugs in reactor-pool: (1) The permit count goes off (2) WIP never reaches 0 -- and hence the connection pool locks up.
I imagine this is pretty niche, but a try/catch wouldn't be out of place here, imo.