HikariCP icon indicating copy to clipboard operation
HikariCP copied to clipboard

"Connection not available" - Potential thread starvation

Open MatthewJPhillips opened this issue 2 years ago • 4 comments

Hello,

We have been using HikariCP for a long time and it is an awesome project so thankyou! Every so often, we see

java.sql.SQLTransientConnectionException: Connection is not available, request timed out after 5000ms.

We work in a highly concurrent environment (100+ worker threads) and I think we are running into a rare race condition that leads to thread starvation. An example scenario that leads to this is:

We start with an empty pool (minimumIdleConnections: 0, maximumConnections: 1)

  1. Worker thread A: Calls getConnection(). Worker thread A triggers the create and polls the handoffQueue
  2. ConnectionAdder thread: Creates the connection, putting it in the sharedList and offering it to worker thread A.
  3. Worker thread B: Calls getConnection(). Scans the sharedList, and sets the newly added connection to IN_USE
  4. Worker thread C: Calls getConnection(). Sees there are no available connection so polls the handoffQueue
  5. Worker thread A: Sees the connection it was offered in step 2 is IN_USE (stolen by worker thread B in step 3) so re-polls the handoffQueue with a reduced timeout
  6. Worker thread B: Closes the connection, marking it as NOT_IN_USE and then offering it to the handoffQueue (i.e picked up by thread C)
  7. Worker thread C: Marks the connection as IN_USE
  8. Worker thread C: Closes the connection, marking it as NOT_IN_USE, and then offering it to the handoffQueue
  9. Worker thread B: Calls getConnection() again. Scans the sharedList and sets the connection to IN_USE
  10. Worker thread A: Sees the connection it was offered in step 8 is IN_USE (stolen by worker thread B in step 9) so re-polls the handoffQueue with a reduced timeout

This can theoretically happen indefinitely, causing thread A to make no progress and timeout.

I have cloned this repo and set up a test (included below) that reproduces the error. There are java.sql.SQLTransientConnectionException showing in the logs. For example, on line 59936, thread 69 times out. The JFR during this time shows thread 69 being re-parked on the handoffQueue with a shorter timeout until the park wait exceeds the timeout (assumedly throwing the timeout exception).

https://user-images.githubusercontent.com/61445055/156873545-ed20e568-f462-4919-b6c4-cfdcf06c87fd.mov

Full JFR hikaricp-jfr.tar.gz

I think this issue comes up because we allow incoming threads to steal connections from threads that are already waiting. I am playing around with an implementation that moves the waiting to before threads check the sharedList which I think would fix this error. Please let me know what you think and whether this is a real issue/something that should be fixed.

Thanks again!

package com.zaxxer.hikari.pool;

import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.sql.SQLException;
import java.util.concurrent.Executors;

import static com.zaxxer.hikari.pool.TestElf.newHikariConfig;
import static com.zaxxer.hikari.pool.TestElf.setConfigUnitTest;

public class ConnectionContestedTest {
   private static final Logger LOGGER = LoggerFactory.getLogger(ConnectionContestedTest.class);

    // This test will run forever. It needs to be cancelled manually.
   @Test
   public void testRaceCondition() throws Exception {
      HikariConfig config = newHikariConfig();
      config.setDataSourceClassName("com.zaxxer.hikari.mocks.StubDataSource");
      config.setTransactionIsolation("TRANSACTION_REPEATABLE_READ");
      config.validate();
      config.setConnectionTimeout(5000);
      config.setMinimumIdle(0);
      config.setMaximumPoolSize(20);
      setConfigUnitTest(true);
      try (HikariDataSource ds = new HikariDataSource(config)) {
         var pool = new HikariPool(config);

         var executor = Executors.newFixedThreadPool(128);
         for (var i = 0; i < 128; i++) {
            executor.submit(() -> getAndReleaseForever(pool));
         }
         while (true)
            Thread.sleep(10000);
      } finally {
         setConfigUnitTest(false);
      }
   }

   private void getAndReleaseForever(HikariPool pool) {
      while (true) {
         try {
            var c = pool.getConnection();
            LOGGER.info("Got connection");
            Thread.sleep(50);
            c.close();
            LOGGER.info("Released connection");
         } catch (SQLException sqlException) {
            LOGGER.error("SQLException", sqlException);
         } catch (InterruptedException e) {
            throw new RuntimeException(e);
         }
      }
   }
}

MatthewJPhillips avatar Mar 05 '22 07:03 MatthewJPhillips

I believe this has hit me recently. I am hitting a scenario intermittently in my production system (every few days) where all my connections in a pool are marked "IN_USE", and all the connections on the database are idle. image image Not able to reproduce this at the moment, but we just found this post so i'll be doing some digging now that we have some more info to go through.

Edit: this is with HikariCP 4.0.3 I am seeing this.

Tostino avatar May 11 '22 16:05 Tostino

Seeing similar with 4.0.3 . Just looking through releases this may have been fixed in 5.0.0. Has this been back ported to older versions ?

NishanthShajahan avatar Jun 21 '22 20:06 NishanthShajahan

That'd be quite annoying, we're not quite up to Java 11+ yet =/.

Tostino avatar Jun 21 '22 20:06 Tostino

It seems we might have encountered this issue as well in our production environment, although it is difficult to say with certainty. Most of the times, we have about 0 or 1, maybe 2 active threads in the hikari connection pool. However, at a certain point in time, this quickly rocketed up to 10 (the default limit we use). For one of our application instances, the active connections were never freed, according to our prometheus data. We are using JDK 11 with HikariCP 4.0.3. For us, this seems to be a relatively rare issue, as this is the first time we have encountered this in about a month. This also means that we are unable to reproduce the results. As far as we can tell, this problem does not seem to be caused by a resource leak in our own code. image

Viserius avatar Aug 03 '22 10:08 Viserius