HikariCP icon indicating copy to clipboard operation
HikariCP copied to clipboard

Connection refused after DB failover for 1 out of 2 connection pools

Open mschwitalla opened this issue 4 years ago • 8 comments

Hi, we use HikariCP and observed a weird behavior of our service after a PostgreSQL db failover was executed.

Some background information.

  • The failover was executed by pointing the Elastic IP to the new primary database.
  • The application has 2 connection pools to the same DB with different timeout values etc.
  • We are using PostgreSQL 12
  • HikariCP version 3.4.5
  • PostgreSQL driver version 42.2.14
  • Spring Boot 2.3.2 application
  • Application is running on Kubernetes

Configuration of the datasources:

spring:
  datasource:
    default:
      driver-class-name: org.postgresql.Driver
      jdbc-url: jdbc:postgresql://localhost:5432/postgres
      username: ${DB_USER:postgres}
      password: ${DB_PASSWORD:postgres}
      pool-name: default
      connection-timeout: 5000
      maximum-pool-size: 10
      validation-timeout: 800
    critical-query:
      driver-class-name: ${spring.datasource.default.driver-class-name}
      jdbc-url: ${spring.datasource.default.jdbc-url}
      username: ${spring.datasource.default.username}
      password: ${spring.datasource.default.password}
      pool-name: critical-query
      maximum-pool-size: 10
      connection-timeout: 300
      validation-timeout: 250
      connection-init-sql: "set statement_timeout to '400ms'; set application_name to 'critical-ops';"

The problem occured on the default datasource, but only for 1 out of 10 pods. And on this one pod only the default connection pool was affected. The critical-query connection pool recovered quickly after the failover. The problem was only fixed once the affected pod was manually terminated after ~30min.

We got the following exceptions:

org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.transaction.CannotCreateTransactionException: Could not open JDBC Connection for transaction; nested exception is java.sql.SQLTransientConnectionException: default - Connection is not available, request timed out after 5000ms.
...
Caused by: org.postgresql.util.PSQLException: Connection to {URL}:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
...
Caused by: java.net.ConnectException: Connection refused (Connection refused)

As no DNS changes were involved, the jvm cache couldn't be a problem. Additionally I would expect that both connection pools would be affected in such a scenario. The same goes for any infrastructure related issue.

We suspect some kind of race condition or concurrency issue that leaves hikari cp in a bad state, because not all db failovers are affected. Any help would be appreciated. In case you need more information, please let me know.

mschwitalla avatar Oct 08 '20 08:10 mschwitalla

@mschwitalla What is the maximum number of connections supported by postgres in 'Default'? You're saying that you got a connection refused to localhost? If the issue occurred on 'Default', and the URL in the configuration is localhost, then can I assume that {URL}:5432 in the exception is referring to localhost:5432?

brettwooldridge avatar Oct 12 '20 16:10 brettwooldridge

@brettwooldridge Sorry for the confusion. I pasted the jdbc-url from the default environment, because I didn't want to paste the production URL. Therefore it's not localhost, but a DB server used in production.

mschwitalla avatar Oct 12 '20 16:10 mschwitalla

we are facing similar issues with Hikari Pool in Kubernetes Pods connectiong to Postgres databases in IBM Cloud. After a switch of the db, Hikari can not connect to database anymore with error message:

[...]
Caused by: org.postgresql.util.PSQLException: Connection to <DB-HOST>:<DB-PORT> refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:285) ~[postgresql-42.2.14.jar!/:42.2.14]
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) ~[postgresql-42.2.14.jar!/:42.2.14]
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:217) ~[postgresql-42.2.14.jar!/:42.2.14]
	[...]

gutschet avatar Nov 09 '20 08:11 gutschet

@gutschet , the same issues. if you have resolved it, please give some idea. Thanks.

qianliniu avatar Feb 02 '21 03:02 qianliniu

@qianliniu , we resolved the problem by setting a socket timeout for the connection pool against Postgres to 90 seconds. We found a lot information here: https://github.com/pgjdbc/pgjdbc/issues/1508

Since that change, our pool is able to reconnect to the database after a switch. But we still have tons of error logs, when the switch of the database happens and the pool loses connection. Not a clean solution, but it works.

gutschet avatar Feb 02 '21 08:02 gutschet

@gutschet Thanks for your reply. I have already add this setting the socketTimeout, but is 30 seconds. String url = "jdbc:postgresql://" + (host + ":" + port + "/" + dbName + "?ssl=false&socketTimeout=30"); I don't think it will work, because in my case the db switch only cost about 20 seconds, but the db connect retried several hours, still the same error.

and I find this ticket, I'm not sure if it is the root cause(JVM DNS caching behavior) it is really hard to reappear the issue, I can not trigger the db switch directly, and it seems not every db switch will cause this issue. https://github.com/brettwooldridge/HikariCP/issues/1075

qianliniu avatar Feb 02 '21 12:02 qianliniu

hi I see this issue on my project .And this issue not happend every time.I can see on load test with jmeter than we see this connection error.My system in aws fargate database in aws rds.I dont find any suitable solutions. My datasource configuration below.

spring: jpa: hibernate: ddl-auto: none properties: hibernate: check_nullability: true default_schema: users show-sql: false open-in-view: false datasource: hikari: connection-timeout: 60000 maximum-pool-size: 10 minimum-idle: 5 idle-timeout: 3000 pool-name: YeniPool jdbc-url: jdbc:postgresql://172.18.0.2:5432/memleket?currentSchema=users password: admin username: admin driver-class-name: org.postgresql.Driver max-lifetime: 3000000

ahmetserdargeze avatar Feb 03 '21 09:02 ahmetserdargeze

I am seeing this issue even after 'evictConnections', [thread checks this every 30 seconds]

this log gets printed

writer ip changed, old is 10.0.133.*** and new is 10.0.29.*** Active connection count is : 0 and idle connection count is 2

 private void checkForIpChange(HikariDataSource dataSource) {
        try {
            String ip = getIp(dataSource.getJdbcUrl());
            if (Objects.isNull(dataSourceIpMap.get(dataSource))) {
                dataSourceIpMap.put(dataSource, ip);
            } else if (!StringUtils.equals(ip, dataSourceIpMap.get(dataSource))) {
                log.info("writer ip changed, old is {} and new is {}", dataSourceIpMap.get(dataSource), ip);
                evictConnections(dataSource);
                dataSourceIpMap.put(dataSource, ip);
            }
        } catch (Exception e) {
            log.error("Caught exception in cname check", e);
        }
    }

    private void evictConnections(HikariDataSource dataSource) {
        HikariPoolMXBean pool = dataSource.getHikariPoolMXBean();
        if (pool != null) {
            log.info("Active connection count is : {} and idle connection count is {}",
                    pool.getActiveConnections(), pool.getIdleConnections());
            pool.softEvictConnections();
        }
    }

    private String getIp(String url) throws URISyntaxException, UnknownHostException {
        URI aURL = new URI(url);
        aURL = new URI(aURL.getSchemeSpecificPart());
        InetAddress address = InetAddress.getByName(aURL.getHost());
        return address.getHostAddress();
    }

@brettwooldridge any active work going on on this or any workarounds ?

dheeraj9198 avatar Aug 21 '23 19:08 dheeraj9198