HikariCP
HikariCP copied to clipboard
"Connection not available" - Potential thread starvation
Hello,
We have been using HikariCP for a long time and it is an awesome project so thankyou! Every so often, we see
java.sql.SQLTransientConnectionException: Connection is not available, request timed out after 5000ms.
We work in a highly concurrent environment (100+ worker threads) and I think we are running into a rare race condition that leads to thread starvation. An example scenario that leads to this is:
We start with an empty pool (minimumIdleConnections: 0, maximumConnections: 1)
- Worker thread A: Calls getConnection(). Worker thread A triggers the create and polls the handoffQueue
- ConnectionAdder thread: Creates the connection, putting it in the sharedList and offering it to worker thread A.
- Worker thread B: Calls getConnection(). Scans the sharedList, and sets the newly added connection to
IN_USE
- Worker thread C: Calls getConnection(). Sees there are no available connection so polls the handoffQueue
- Worker thread A: Sees the connection it was offered in step 2 is
IN_USE
(stolen by worker thread B in step 3) so re-polls the handoffQueue with a reduced timeout - Worker thread B: Closes the connection, marking it as
NOT_IN_USE
and then offering it to the handoffQueue (i.e picked up by thread C) - Worker thread C: Marks the connection as
IN_USE
- Worker thread C: Closes the connection, marking it as
NOT_IN_USE
, and then offering it to the handoffQueue - Worker thread B: Calls getConnection() again. Scans the sharedList and sets the connection to
IN_USE
- Worker thread A: Sees the connection it was offered in step 8 is
IN_USE
(stolen by worker thread B in step 9) so re-polls the handoffQueue with a reduced timeout
This can theoretically happen indefinitely, causing thread A to make no progress and timeout.
I have cloned this repo and set up a test (included below) that reproduces the error. There are java.sql.SQLTransientConnectionException
showing in the logs. For example, on line 59936, thread 69 times out. The JFR during this time shows thread 69 being re-parked on the handoffQueue with a shorter timeout until the park wait exceeds the timeout (assumedly throwing the timeout exception).
https://user-images.githubusercontent.com/61445055/156873545-ed20e568-f462-4919-b6c4-cfdcf06c87fd.mov
Full JFR hikaricp-jfr.tar.gz
I think this issue comes up because we allow incoming threads to steal connections from threads that are already waiting. I am playing around with an implementation that moves the waiting to before threads check the sharedList which I think would fix this error. Please let me know what you think and whether this is a real issue/something that should be fixed.
Thanks again!
package com.zaxxer.hikari.pool;
import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;
import org.junit.Test;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.SQLException;
import java.util.concurrent.Executors;
import static com.zaxxer.hikari.pool.TestElf.newHikariConfig;
import static com.zaxxer.hikari.pool.TestElf.setConfigUnitTest;
public class ConnectionContestedTest {
private static final Logger LOGGER = LoggerFactory.getLogger(ConnectionContestedTest.class);
// This test will run forever. It needs to be cancelled manually.
@Test
public void testRaceCondition() throws Exception {
HikariConfig config = newHikariConfig();
config.setDataSourceClassName("com.zaxxer.hikari.mocks.StubDataSource");
config.setTransactionIsolation("TRANSACTION_REPEATABLE_READ");
config.validate();
config.setConnectionTimeout(5000);
config.setMinimumIdle(0);
config.setMaximumPoolSize(20);
setConfigUnitTest(true);
try (HikariDataSource ds = new HikariDataSource(config)) {
var pool = new HikariPool(config);
var executor = Executors.newFixedThreadPool(128);
for (var i = 0; i < 128; i++) {
executor.submit(() -> getAndReleaseForever(pool));
}
while (true)
Thread.sleep(10000);
} finally {
setConfigUnitTest(false);
}
}
private void getAndReleaseForever(HikariPool pool) {
while (true) {
try {
var c = pool.getConnection();
LOGGER.info("Got connection");
Thread.sleep(50);
c.close();
LOGGER.info("Released connection");
} catch (SQLException sqlException) {
LOGGER.error("SQLException", sqlException);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
}
I believe this has hit me recently. I am hitting a scenario intermittently in my production system (every few days) where all my connections in a pool are marked "IN_USE", and all the connections on the database are idle.
Not able to reproduce this at the moment, but we just found this post so i'll be doing some digging now that we have some more info to go through.
Edit: this is with HikariCP 4.0.3 I am seeing this.
Seeing similar with 4.0.3 . Just looking through releases this may have been fixed in 5.0.0. Has this been back ported to older versions ?
That'd be quite annoying, we're not quite up to Java 11+ yet =/.
It seems we might have encountered this issue as well in our production environment, although it is difficult to say with certainty.
Most of the times, we have about 0 or 1, maybe 2 active threads in the hikari connection pool.
However, at a certain point in time, this quickly rocketed up to 10 (the default limit we use).
For one of our application instances, the active connections were never freed, according to our prometheus data.
We are using JDK 11 with HikariCP 4.0.3.
For us, this seems to be a relatively rare issue, as this is the first time we have encountered this in about a month.
This also means that we are unable to reproduce the results.
As far as we can tell, this problem does not seem to be caused by a resource leak in our own code.