reactor-pool icon indicating copy to clipboard operation
reactor-pool copied to clipboard

returnPermits exception stalls connection pool

Open erikbeerepoot opened this issue 1 year ago • 2 comments

We are seeing an issue with drainLoop in SimpleDequePool that is similar to this issue. We see the following pattern:

  1. Elevated errors over a period of hours (in our case, we see a high count of StackOverflowErrors). We are still attempting to understand the cause of these errors.
  2. We observe a Too many permits returned error in our logs. In our case, the connection pool is configured with 500 connections, so the error is Too many permits returned: returned=1, would bring to 501/500..
  3. Idle an active connections go to 0 (observed via metrics) and pending connections continue to build (we've seen many thousands).
  4. No more outgoing request are observed.

We have manually attempted to reproduce this issue by implementing a custom AllocationStrategy that delegates to an actual allocation strategy but throws an exception in returnPermits (either randomly, or after a given number of calls). After this exception is thrown, we observe the same behaviour as above. We don't currently understand how reach a state where the PERMITS count is off (we are unable to enable DEBUG logging in production due to PII concerns, but are looking to deploy some changes that will add logging on the number of permits as well as the state of the connection pool. Any pointers you have would be appreciated).

One hypothesis is that the exception is not handled, and WIP is not decremented. When doAcquire() is called, elements are added to the pending queue as before, but drainLoop cannot be entered as WIP will not be 0.

Another option is that it's hitting this exception in destroyPoolable.

Expected Behavior

Ideally, the connection pool would continue to function. I don't know if that is realistic given that PERMITS would continue to be wrong.

Actual Behavior

No more connections are made.

Steps to Reproduce

We realize this is a contrived example, but matches what we end up seeing in production.

class ThrowingAllocationStrategy(...) {
    private val allocationStrategyDelegate = // your actually, allocation strategy here, eg. Http2AllocationStrategy, SizeBasedAllocationStrategy
    // implement the remainder by just delegating the calls to the underlying allocation strategy

    override fun returnPermits(p0: Int) {
        randomlyThrow()

        allocationStrategyDelegate.returnPermits(p0)
    }

    private fun randomlyThrow() {
        if (Random.nextInt(100) == 0) {
            throw RuntimeException("Boom!")
        }
    }
}

<SNIP>
// init custom connection provider with the custom allocation strategy
return ConnectionProvider
        .allocationStrategy(loggingAllocationStrategy!!)
        .build()
  

Possible Solution

Given the the PERMITS count is off, it's unclear to me what the proper resolution would be. We noticed that disposing of the connection provider did resolve the issue as expected, presumably because the connection pool would be fresh and not in a bad state. I'm not sure what the implications would be of continuing to use the existing connection pool, as I would expect to see the permits exception being thrown repeatedly.

Your Environment

reactor-pool == 0.2.13
io.projectreactor:reactor-core:3.5.20 -> 3.6.9
org.springframework:spring-webflux -> 6.1.12
io.projectreactor.netty:reactor-netty-core:1.1.22
netty:4.1.112.Final

Also tried the latest main commit for reactor-core and reactor-pool.

  • Other relevant libraries versions (eg. netty, ...): netty, netty == 4.1.x,
  • JVM version (java -version): 21, 17.
  • OS and version (eg uname -a):

erikbeerepoot avatar Dec 09 '24 17:12 erikbeerepoot

@erikbeerepoot Something is off here, Reactor Netty 1.1.x uses Reactor Pool 1.0.x. https://github.com/reactor/reactor-netty/blob/1.1.x/gradle.properties

Also Reactor Pool 0.2.x reached OSS EOL https://github.com/reactor/.github/blob/main/SUPPORT.adoc#support-timeline

violetagg avatar Jan 14 '25 08:01 violetagg

@violetagg Interesting, that is unexpected for sure. I will look into why this version is being used.

EDIT: Turns out the wrong reactor-pool version was an artifact of some local debugging steps I took, but doesn't reflect the version in the application.

Taking a glance at the current code though, it's similar enough as to have the same issue if this exception were to be thrown. I imagine it's pretty rare.

erikbeerepoot avatar Jan 16 '25 18:01 erikbeerepoot

To close the loop on this -- the above is definitely a bug in the reactor-pool code. What was happening in our case, is that we were adding too many onErrorDropped hooks in our application (a bug in our code). When enough hooks are added, this recursive call results in a StackOverflowError. This SOE appears to have exposed multiple bugs in reactor-pool: (1) The permit count goes off (2) WIP never reaches 0 -- and hence the connection pool locks up.

I imagine this is pretty niche, but a try/catch wouldn't be out of place here, imo.

erikbeerepoot avatar Oct 22 '25 15:10 erikbeerepoot