gocql "gocql: no hosts available in the pool" cannot recover without restart

We had a cluster degrades due to increased traffic load, and on the client side we see "gocql: no hosts available in the pool" error for ~80% of the requests.

The error rate is very consistent, and the C* cluster is healthy except the load. We try to reduce the load on the cluster, but the error is still consistent. It lasts for two hours.

We restart the client, and the errors go away immediately.

Also one thing interesting is that after we restart the gocql client, we also see an immediate drop on coordinator latency and traffic on the C* cluster (Yes, client restarts first, then the latency and traffic on C* drops, not the other way around). It may be due to upstream users stop retry after we restart the server and no longer see "gocql: no hosts available in the pool", but we are not sure about the root cause yet.

May 23 '17 04:05 zhixinwen

After we close a connection due to too many timeouts, when would the connection be made again?

I do see gocql wait for event from control connection to reconnect. But if no node is perceived as down by the control node, and it never emits a 'node up' event, would gocql try to recreate the connections?

May 23 '17 04:05 zhixinwen

@Zariel gocql closes a connection when it takes too long to respond, when would the connection be reconnected?

I may miss some of the code, looks like it will only reconnect when it receives "Node Up" event?

Jun 01 '17 17:06 zhixinwen

It should be periodically try to dial nodes that are down, if you can reproduce and build with gocql_debug you should see more info about the pool state.

Also if all the hosts are down, then the host list should be reset to the initial hosts and they will be attempted.

Jun 27 '17 21:06 Zariel

I think the root cause of the problem here is:

When all connections to a hosts are closed, the host is not marked as down, so ReconnectInterval would not help
HandleError in hostConnPool would try to refill the connection pool when a connection is closed, but in an outage it is very likely to fail. Moreover it would not retry afterward. As a result, when the cluster recovers after all the connections are closed, gocql would not attempt to rebuild the connection pool

To fix this, we need to mark the host as DOWN, when we close all the connections on that host.

Aug 09 '17 00:08 zhixinwen

I simulated network connectivity issues using iptables:

sudo iptables -I OUTPUT -p tcp -d 127.0.0.1 --dport 9042 -j DROP

After that upon queries I start to receive this error message: gocql: no response received from cassandra within timeout period And soon after it this message: gocql: no hosts available in the pool

After deleting the rule (sudo iptables -D OUTPUT 1) gocql indeed does not recover automatically.

Aug 28 '17 11:08 balta2ar

Running a single node of Cassandra (on mac so excuse lack of iptables), able to query host as expected when its running. Kill -9 the process, gocql receives no host down notification and discovers the control connection is dead and marks the host down. Restarting Cassandra the driver reconnects and queries can continue. Repeated with STOP/CONT and had the same results.

Can you rerun your test with gocql_debug build tag

Aug 28 '17 15:08 Zariel

I think kill -9 is not a accurate simulation, simulation of @balta2ar seems to be more similar to the issue I see.

Feb 14 '18 00:02 zhixinwen

Do you know of a way to replicate on OSX or reliably for regression tests?

Feb 14 '18 15:02 Zariel

When this error occurs I have had to cycle the container as there is no good way to reconnect after this event. There must be a better way but this error has caused so many problems

Feb 24 '18 22:02 robdefeo

seems that setting ClusterConfig.ReconnectInterval > 0 has no affect on this.

Jun 11 '18 18:06 engin

+1

Any fixes for this issue ?. In single node cassandra this issue is consistently reproducible

@Zariel

Feb 21 '19 09:02 shettyh

Constantly seeing this issue on single-node Cassandra cluster. cqlsh works fine 100% of the time while gocql seems to take a lot of time connecting to the cluster, often resulting in "gocql: no hosts available in the pool" error.

Jun 01 '19 20:06 steebchen

We have this problem too. Have had it since at least Oct of 2018. I had to put in special code to detect this and cause the application to be recycled. Our apps call gocql under very heavy load (millions an hour requests). For a long time the recycle occurred once a month or so. But the loads have increased recently and it is now up to every days or even shorter. We have 4 instances running. It is not always all 4 instances at the same time. Out of nowhere with no proceeding errors logged we all of a sudden get this in a running instance. The only fix is to restart the instance. Occasionally other instances will shortly run into the problem to and need restarting. If we restart there is no problem at all connecting to the nodes.

The main problem we have is during the detection and restarting period there is a significant slow down in processing as the instance is not available.

Jun 05 '19 19:06 rlk833

I ran into a similar issue a couple of days ago and followed @balta2ar instructions to try and recreate the situation.

And surprisingly gocql reconnected after removing the rule (sudo iptables -D OUTPUT 1).

Here are the logs:

{"msg":"Error reading from storage (gocql: no response received from cassandra within timeout period)","appname":"kask","time":"2020-05-21T18:05:40-04:00","level"
:"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:05:47-04:00","level":"WARNING"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:47-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:50-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"Error reading from storage (gocql: no hosts available in the pool)","appname":"kask","time":"2020-05-21T18:05:51-04:00","level":"ERROR","request_id":"0000
0000-0000-0000-0000-000000000000"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:05:58-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:09-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeDown: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:20-04:00","level":"WARNING"}
{"msg":"gocql: Session.handleNodeUp: 127.0.0.1:9042","appname":"kask","time":"2020-05-21T18:06:23-04:00","level":"WARNING"}

Any other ideas on how to replicate this issue?

May 21 '20 22:05 clarakosi

Running into this while using gocql to write to AWS Keyspaces. Any updates on this?

Oct 01 '20 10:10 SubNader

I'm also experiencing this issue extremely frequently with AWS Keyspaces. It appears to be entirely due to state of the process. Other processes can access Keyspaces without issue, and restarting an affected process immediately resolves the issue. I'm up for helping to debug however I can.

Oct 07 '20 14:10 mattmassicotte

it seems like it happens because of https://github.com/gocql/gocql/blob/390af0fb2915aecc9d18df7afa11501fc2d4c9c5/connectionpool.go#L248 and then we add host back only at https://github.com/gocql/gocql/blob/390af0fb2915aecc9d18df7afa11501fc2d4c9c5/events.go#L187-L187 which if I understand correctly happens only when a new node is starting, but not when a node temporarily lost network connection or because of overload rejected request.

Apr 09 '21 01:04 Draal

There is pool.addHost in reconnectDownedHosts. reconnectDownedHosts periodically adds nodes that are marked as down. However, we set the host state as down only when we get such event from the database, which does not occur if just gocql lost connections (or closed them due to errors).

Apr 14 '21 13:04 martin-sucha

I'm trying to reproduce this using docker-compose on Mac. I'm using the alpine image for my client, installed iptables via apk --update add iptables, then added

    cap_add:
      - NET_ADMIN

to my docker-compose.yaml so I can use iptables inside the container. I have a small Go program that does SELECT requests in a loop.

When I execute iptables -I OUTPUT -p tcp --dport 9042 -j DROP, this happens:

client_1     | 2021/04/22 15:08:22 SELECTed: "data"
client_1     | 2021/04/22 15:08:23 SELECTed: "data"
client_1     | 2021/04/22 15:08:25 SELECT failed: gocql: no response received from cassandra within timeout period
client_1     | 2021/04/22 15:08:26 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:08:26 SELECT failed: gocql: connection closed waiting for response
client_1     | 2021/04/22 15:08:26 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:27 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:28 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:08:28 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:08:29 SELECT failed: gocql: no hosts available in the pool

So far, so good. Occasionally, an automatic reconnect is apparently attempted:

client_1     | 2021/04/22 15:09:20 connection failed "172.22.0.2": dial tcp 172.22.0.2:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy

But when I then remove the rule via iptables -D OUTPUT 1, gocql recovers immediately:

client_1     | 2021/04/22 15:10:28 SELECT failed: gocql: no hosts available in the pool
client_1     | 2021/04/22 15:10:29 gocql: Session.handleNodeDown: 172.22.0.2:9042
client_1     | 2021/04/22 15:10:29 gocql: Session.handleNodeUp: 172.22.0.2:9042
client_1     | 2021/04/22 15:10:29 SELECTed: "data"
client_1     | 2021/04/22 15:10:30 SELECTed: "data"

However I know that the issue is not fixed entirely, because we had an outage due to it recently. Now I don't know how to reproduce it.

Apr 22 '21 15:04 mrwonko

Correction: I had been testing the outdated commit 1418e1105e72. With HEAD (909f2a77f46e) I can reproduce the issue using the instructions above, it looks like the first commit that re-introduced the issue was c80d1b329030. I'll see if I can understand the issue and fix it.

Apr 26 '21 13:04 mrwonko

I'm fairly sure the regression is connected to the removal of this call to session.handleNodeUp() in controlConn.setupConn(), as I can reproduce the issue if I take 1418e11 and just remove that line. But I don't know the code well enough to fix it. I tried to simply add it back in, but that didn't seem to do anything on HEAD, and randomly undoing changes I don't understand is not a good way to fix this.

I'd appreciate if somebody with a better understanding of the code could take over from here.

Apr 26 '21 18:04 mrwonko

I tried to reproduce using a similar docker-compose setup as above and it seems that 1418e1105e720e09744dd50f9b21aed3911bf434 is much quicker to reconnect than 909f2a77f46e7e16f59959bfce1f7bdcc4bb28c5. The latter reconnects about up to a minute after I delete the iptables rule, looks like because of ClusterConfig.ReconnectInterval.

Is that what you observed or did you reproduce a case when the driver did not reconnect at all?

Apr 30 '21 13:04 martin-sucha

I don't think I waited a full minute for it to reconnect, I probably canceled my test before then.

Apr 30 '21 13:04 mrwonko

Just hit this. Did a rolling restart of C* cluster. Service started throwing a lots of "gocql: no hosts available in the pool". Had to restart the application. I am on latest i think gocql v0.0.0-20210425135552-909f2a77f46e

May 03 '21 17:05 maxstepanov

has anyone faced this after #1555 was merged on May 13? I just encountered it using a version before that patch, and am upgrading now to see if I face it again. The issue is rare enough for me that it sometimes takes weeks to manifest

Jul 16 '21 17:07 raidancampbell

This issue is easily reproducible as per my notes on https://github.com/yugabyte/yugabyte-db/issues/10182

I was able to repro this with a Yugabyte operator cluster running on k8s with a simple scale down of all nodes and back up. Problem is exactly as described here and on the other issue. Connection pool indefinitely remains down, and original host names are never reevaluated to pick up new ip addresses.

Nov 01 '21 09:11 justinfx

Hi,

We get the problem too.

Vault: 1.8.2 Cassandra: 3.11.6

When we try to generate some credentials: failed to roll back user v_kube_xxxxxx: gocql: no hosts available in the pool

No error/warn/info in Vault logs about that, no logs on Cassandra too. Firewall ok, netcat ok. Only the "reset" connection from Vault allow us to go back operational.

https://www.vaultproject.io/api/secret/databases#reset-connection

Any updates ?

Jan 18 '22 17:01 mouglou

So far I have had to do an applications-side hack, to work around the issue, but it still isn't 100%. I'm using the yugabyte forked client, which should be the same upstream problem: https://github.com/yugabyte/yugabyte-db/issues/10182#issuecomment-993792821

Jan 18 '22 18:01 justinfx

@mouglou Vault 1.8.2 use old version of gocql (https://github.com/hashicorp/vault/blob/main/go.mod#L53). There is PR open to bump the version - https://github.com/hashicorp/vault/pull/12879 and resolve https://github.com/hashicorp/vault/issues/12878 but it was not merged yet. The fix for issue you described was merged in gocql here: https://github.com/gocql/gocql/commit/312a614907c8d2e19f6b14aff16015436cb78423

Jan 18 '22 18:01 player1-github

@player1-github thanks for your feedback ! really appreciate ! Will follow this PR then.

Jan 18 '22 21:01 mouglou