gocql icon indicating copy to clipboard operation
gocql copied to clipboard

gocql does not re-resolve DNS names

Open chummydog opened this issue 9 years ago • 29 comments

In using the gocql library from about two weeks ago I noticed the following issue (we had been using a gocql version from last april and see the same issue, even though your code has changed in this area - you seemed to fix a big in ring.go). Our application runs in a cloud environment where cassandra instances can move around from node to node (IP addresses will be different). Therefore, we use dns to manage this. In this use case, we have a single node Cassandra cluster, and at application startup, pass in this name to our application (which is passed to the gocql Session abstraction). All works well until the cassandra node is restarted, which means it has started bound to a new IP. The control connection in gocql fails as it notices the connection to the old IP has been closed. As this point in time, the only answer is to restart out application because gocql has no way to know the new IP of the cassandra node. The issue seems to be that gocql loses information regarding the dns name we passed in. I'm new to gocql, but can't find a way (via some config setting) to address this issue in our application. Any help would be appreciated.

chummydog avatar Nov 15 '16 14:11 chummydog

Any follow up on this issue? We having similar issue with this as well when running in the kubernetes services. The IP of the cassandra change slightly (from 10.48.0.54 to 10.48.0.56) after the pod image has been updated to new version. When this happened, the error will be thrown out.

kenng avatar Jan 12 '17 02:01 kenng

Same problem with deploy in kubernetes.

laz2 avatar Jan 30 '17 12:01 laz2

We were frequently getting this error but for us it ended up being Cassandra-related, not gocql-related. Our single instance Cassandra cluster was frequently hitting stop the world garbage collection for hundreds of milliseconds at a time. We optimized our application database operations, tuned the Cassandra JVM settings, changed the GC mechanism (CMS -> G1GC), lengthened the gocql timeouts (600ms -> 5000ms) and gave Cassandra more RAM. The GC stoppage after these changes is much less frequent and much shorter now and we no longer see this gocql error. Not sure if that helps any of you hitting this same error but might want to check your Cassandra system.log and/or gc.log to see if no hosts are available because Cassandra is not responding to events.

jdness avatar Jan 30 '17 17:01 jdness

TL;DR: In K8s (or equivalent), you should be using Pet Sets (or equivalent) with a stable hostname and linked volumes for your stateful services, like Cassandra. Basically this: http://blog.kubernetes.io/2016/07/thousand-instances-of-cassandra-using-kubernetes-pet-set.html


Shifting hosts under DNS is sort of an anti-pattern in C* because it relies on concrete, addressable targets in order to gossip about and maintain cluster state. Cassandra is a stateful service and is in constant communication with its peers about their (and their neighbors) individual states. It's unreasonable to expect the state of your local DNS and TTLs would propagate in perfect sync with the state of an arbitrary number of nodes in a distributed system.

Additionally, C* clients attempt to establish connections with all (or many) nodes in the cluster, not just the one(s) you provide in ClusterConfig. This allows clients to intelligently make decisions about query load balancing and cluster availability. Recall that Cassandra has no master nodes so all nodes are equally available to serve queries. The mantra is "no single point of failure" and, as you've discovered, DNS can be a single point of failure.

This problem is not really related to gocql in particular. I believe you'd find the same problem in any of the other stable Cassandra drivers because of how Cassandra and any client driver is (and must be) designed.


Regarding GC and high load: Running a single-node of Cassandra doesn't really make sense, but I understand if you're evaluating it from a development standpoint. (Even so, a small 3 node cluster will help familiarize yourself with consistency levels and replication)

Long GC STW pauses is a strong sign that your "cluster" is overloaded. Tuning Cassandra and the JVM (especially the heap) proportional to your container's allocated CPU and memory for your deployment is usually necessary, in any case.

Some reading on GC and C* tuning:

  • http://stackoverflow.com/questions/21992943/persistent-gc-issues-with-cassandra-long-app-pauses
  • https://gist.github.com/tobert/ea9328e4873441c7fc34
  • https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

robusto avatar Feb 04 '17 01:02 robusto

@robusto Hi, thanks for your explain.

We use Cassandra in K8s Stateful Sets (follow the official example), then the client can use a pod DNS name to connect to servers. But When some pods been deleted (e.g. because of node migration) while the data volumes still the same, the client still receive errors about the old IPs.

Is there anyway to solve this problem?

idealhack avatar Aug 03 '17 05:08 idealhack

Can you build with the gocql_debug tag and provide the logs of the hosts being discovered? Cassandra should notify the driver the node went down, then another one came up.

Zariel avatar Aug 03 '17 12:08 Zariel

@Zariel Thank you.

Output from nodetool status, which is latest and correct:

Datacenter: dev
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
DN  10.244.3.66  273.71 KiB  32           100.0%            57e4117c-db4a-4eeb-b51a-f24edf8da8a4  test
UN  10.244.2.79  2.29 GiB   32           100.0%            3f70d9f9-4803-46a7-b8af-41dff9b5a527  test
UN  10.244.4.47  174.86 KiB  32           100.0%            1768dbdd-addd-442d-ab8c-fb52b126307d  test

Output from gocql:

2017/08/03 21:31:48 gocql: Session.handleNodeUp: 10.244.2.79:9042
2017/08/03 21:31:50 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/03 21:31:50 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/03 21:31:51 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/03 21:31:52 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/03 21:31:52 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: getsockopt: no route to host
2017/08/03 21:31:52 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/03 21:31:54 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/03 21:31:54 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/03 21:31:56 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/03 21:31:56 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/03 21:31:58 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/03 21:31:58 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/03 21:31:58 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/03 21:31:59 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/03 21:32:00 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/03 21:32:00 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/03 21:32:01 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/03 21:32:01 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/03 21:32:01 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/03 21:32:03 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/03 21:32:03 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/03 21:32:03 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/03 21:32:05 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/03 21:32:05 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/03 21:32:05 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/03 21:32:05 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/03 21:32:07 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/03 21:32:07 gocql: Session.handleNodeUp: 10.244.3.66:9042
2017/08/03 21:32:07 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/03 21:32:07 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: getsockopt: no route to host
2017/08/03 21:32:07 gocql: Session.handleNodeUp: 10.244.2.79:9042

And output from C++ client in the same cluster (if it helps):

1501767219.595 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767219.597 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767219.600 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.6.10 because of the following error: Connection timeout
1501767219.602 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.6.10 because of the following error: Connection timeout
1501767219.967 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.5.5 because of the following error: Connection timeout
1501767219.970 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.5.5 because of the following error: Connection timeout
1501767220.105 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'
1501767220.105 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'
1501767220.120 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.36 because of the following error: Connect error 'host is unreachable'
1501767220.120 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.36 because of the following error: Connect error 'host is unreachable'
1501767220.567 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.34 because of the following error: Connect error 'host is unreachable'
1501767249.586 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.34 because of the following error: Connect error 'host is unreachable'
1501767249.588 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.4.34 because of the following error: Connect error 'host is unreachable'
1501767249.653 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767249.654 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.57 because of the following error: Connect error 'host is unreachable'
1501767250.165 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'
1501767250.165 [WARN] (src/pool.cpp:392:virtual void cass::Pool::on_close(cass::Connection*)): Connection pool was unable to reconnect to host 10.244.3.66 because of the following error: Connect error 'host is unreachable'

I reproduced this by following the steps I said in last comment.

idealhack avatar Aug 03 '17 13:08 idealhack

I'm assuming the node 10.244.5.5 goes down for node migration here

2017/08/03 21:31:50 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/03 21:31:50 gocql: Session.handleNodeDown: 10.244.5.5:9042

Then returns with a new kubernetes assigned IP address here

2017/08/03 21:31:58 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/03 21:31:59 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout

Question is, does C* send a UP node event with the original ip address or the new ip address.

This log suggests C* sends the original ip address in the UP node event. (which kinda makes sense) If it's the original ip address gocql uses the HostInfo from the existing ring entry when attempting to dial the node; hence the dial errors.

Perhaps gocql receives a MOVED_NODE event during this time but gocql doesn't handle it. If we did, we could refresh the ring and connect to the updated ip address?

@idealhack are you able to reproduce with the gocql_debug compile flag enabled? Enabling that flag should log all the received events during the kubernetes node migration; that will give us more information.

if you are using K8 StatefulSets then when the pod returns it should have the same IP address as before, and gocql should have no issue reconnecting to the node. Can you confirm the C* pod returns with the same ip address?

thrawn01 avatar Aug 03 '17 15:08 thrawn01

@thrawn01 Thank you.

I'm convinced that a pod's IP should not change when it crashes and restarts, but it will change when it has been deleted and another pod comes up, which is the situation I reproduced, with gocql_debug flag enabled.

The logs above were wrote after the deletion and recreation, even more, the clients were also restarted, so I thought the old IPs should never appears in client logs.

As I said, this leads me to one reason: the old IPs were stored and have been wrote to disk (and did not been removed after the pod been deleted). If so, I wondered if there is a way to avoid this?

I will try to get more logs to cover when the pods been deleted.

idealhack avatar Aug 04 '17 04:08 idealhack

I deleted all pods (without deleting data on most node) again, and found out the first pod has those old IPs (I guess it reads from previous disk data). And I continued adding new pod.

At last the nodetool status on the first pod reports:

Datacenter: dev
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host ID                               Rack
DN  10.244.4.36  ?          32           35.1%             c1f27d75-489a-4755-93d4-a50398a76233  test
UN  10.244.9.5   72.74 KiB  32           33.2%             fd0c49a4-29d1-4d8b-8641-580bc7673ce5  test
DN  10.244.5.5   ?          32           29.1%             742cf6aa-c046-416e-8bcc-2bc5c13e423a  test
UN  10.244.3.71  180.74 KiB  32           32.7%             a45e7118-ebbe-4c98-90d0-55aeca302baa  test
UN  10.244.2.81  2.35 GiB   32           26.7%             3f70d9f9-4803-46a7-b8af-41dff9b5a527  test
DN  10.244.3.66  ?          32           30.1%             57e4117c-db4a-4eeb-b51a-f24edf8da8a4  test
DN  10.244.4.34  ?          32           34.6%             d0abe073-38b5-46ca-9065-5f82ed0e5372  test
UN  10.244.4.51  70.28 MiB  32           24.5%             1768dbdd-addd-442d-ab8c-fb52b126307d  test
DN  10.244.3.57  ?          32           24.2%             04f42c55-a415-47bf-b99a-546fa9749497  test
DN  10.244.6.10  ?          32           29.7%             425f46be-0a65-4d4b-870c-1a1417f11cce  test

In the mean time, gocql reports (omits some redundant part):

2017/08/04 14:39:23 Session.ring:[10.244.3.66:DOWN][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:DOWN][10.244.4.34:DOWN][10.244.4.47:DOWN][10.244.4.36:DOWN]
2017/08/04 14:39:23 gocql: Session.handleNodeUp: 10.244.3.66:9042
2017/08/04 14:39:25 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/04 14:39:25 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/04 14:39:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 14:39:27 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 14:39:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 14:39:27 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 14:39:28 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 14:39:28 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 14:39:28 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 14:39:30 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 14:39:30 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/04 14:39:30 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 14:39:32 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/04 14:39:32 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 14:39:32 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/04 14:39:34 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 14:39:34 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 14:39:34 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 14:39:36 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 14:39:36 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 14:40:23 Session.ring:[10.244.4.34:DOWN][10.244.4.47:DOWN][10.244.4.36:DOWN][10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:DOWN]

...

2017/08/04 14:44:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 14:44:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 14:44:25 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 14:44:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 14:44:27 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 14:44:27 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 14:44:27 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 14:44:28 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 14:44:29 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 14:44:29 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 14:44:30 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 14:44:30 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 14:44:30 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 14:44:32 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 14:44:32 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/04 14:44:32 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 14:44:34 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/04 14:44:34 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/04 14:45:23 Session.ring:[10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:DOWN][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:DOWN]

...

2017/08/04 14:52:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 14:52:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 14:52:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 14:52:25 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 14:52:26 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 14:52:27 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 14:52:27 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 14:52:28 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 14:52:28 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 14:52:28 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 14:52:30 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 14:52:30 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 14:52:30 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 14:52:32 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 14:52:32 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 14:53:23 Session.ring:[10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:DOWN][10.244.3.57:UP]

...

2017/08/04 15:00:23 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:00:25 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:00:25 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:00:25 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 15:00:27 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 15:00:27 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 15:00:27 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:00:28 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:00:28 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:01:23 Session.ring:[10.244.2.81:UP][10.244.5.5:DOWN][10.244.6.10:UP][10.244.3.57:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.3.66:UP]

...

2017/08/04 15:03:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:03:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:03:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:03:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:03:27 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:03:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:03:48 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.9.5 port=9042]
2017/08/04 15:03:48 gocql: handling frame: [status_change change=UP host=10.244.9.5 port=9042]
2017/08/04 15:03:49 gocql: dispatching event: &{change:UP host:[10 244 9 5] port:9042}
2017/08/04 15:03:49 gocql: Session.handleNodeUp: 10.244.9.5:9042
2017/08/04 15:04:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.57:UP][10.244.3.66:UP]

...

2017/08/04 15:15:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:15:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:15:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:15:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:15:27 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:15:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:15:51 gocql: handling frame: [topology_change change=NEW_NODE host=10.244.3.71 port=9042]
2017/08/04 15:15:51 gocql: handling frame: [status_change change=UP host=10.244.3.71 port=9042]
2017/08/04 15:15:52 gocql: dispatching event: &{change:UP host:[10 244 3 71] port:9042}
2017/08/04 15:15:52 gocql: Session.handleNodeUp: 10.244.3.71:9042
2017/08/04 15:16:23 Session.ring:[10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.57:UP][10.244.3.66:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP]
2017/08/04 15:16:23 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:16:25 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:16:25 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:16:25 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:16:27 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:16:27 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:17:23 Session.ring:[10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:17:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:17:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:17:25 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:17:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:17:26 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:17:27 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:18:23 Session.ring:[10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN]
2017/08/04 15:18:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:18:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:18:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:19:23 Session.ring:[10.244.3.57:UP][10.244.3.66:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP]
2017/08/04 15:19:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:19:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:19:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:20:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP]
2017/08/04 15:20:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:20:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:20:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:21:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:21:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:21:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:21:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:22:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:22:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:22:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:22:25 gocql: Session.handleNodeDown: 10.244.4.47:9042


2017/08/04 15:23:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP]
2017/08/04 15:23:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:23:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:23:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:24:23 Session.ring:[10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:24:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:24:24 gocql: handling frame: [status_change change=UP host=10.244.4.51 port=9042]
2017/08/04 15:24:25 gocql: dispatching event: &{change:UP host:[10 244 4 51] port:9042}
2017/08/04 15:24:25 gocql: Session.handleNodeUp: 10.244.4.51:9042
2017/08/04 15:24:25 Found invalid peer '[HostInfo connectAddress="<nil>" peer="10.244.4.51" rpc_address="10.244.4.51" broadcast_address="<nil>" port=9042 data_centre="dev" rack="test" host_id="1768dbdd-addd-442d-ab8c-fb52b126307d" version="v3.9.0" state=UP num_tokens=0]' Likely due to a gossip or snitch issue, this host will be ignored
2017/08/04 15:24:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:24:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:25:23 Session.ring:[10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP]
2017/08/04 15:25:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:25:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:25:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:26:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.4.51:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP]
2017/08/04 15:26:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:26:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:26:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:27:23 Session.ring:[10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.3.66:UP][10.244.4.51:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP]
2017/08/04 15:27:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:27:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:27:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:28:23 Session.ring:[10.244.3.66:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP][10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN]
2017/08/04 15:28:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:28:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:28:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
2017/08/04 15:29:23 Session.ring:[10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.3.66:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP]
2017/08/04 15:29:23 gocql: Session.handleNodeUp: 10.244.4.47:9042
2017/08/04 15:29:25 unable to dial "10.244.4.47": dial tcp 10.244.4.47:9042: i/o timeout
2017/08/04 15:29:25 gocql: Session.handleNodeDown: 10.244.4.47:9042
^C

Hmm, seems something wrong about those UP nodes, right?

I stop gocql and run it again, it reports:

2017/08/04 15:38:12 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/04 15:38:12 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/04 15:38:12 Found invalid peer '[HostInfo connectAddress="<nil>" peer="10.244.4.51" rpc_address="10.244.4.51" broadcast_address="<nil>" port=9042 data_centre="dev" rack="test" host_id="1768dbdd-addd-442d-ab8c-fb52b126307d" version="v3.9.0" state=UP num_tokens=0]' Likely due to a gossip or snitch issue, this host will be ignored
2017/08/04 15:38:12 gocql: Session.handleNodeUp: 10.244.5.5:9042
2017/08/04 15:38:14 unable to dial "10.244.5.5": dial tcp 10.244.5.5:9042: i/o timeout
2017/08/04 15:38:14 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 15:38:14 gocql: Session.handleNodeDown: 10.244.5.5:9042
2017/08/04 15:38:16 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout
2017/08/04 15:38:16 gocql: Session.handleNodeUp: 10.244.3.71:9042
2017/08/04 15:38:16 gocql: Session.handleNodeDown: 10.244.6.10:9042
2017/08/04 15:38:16 gocql: Session.handleNodeUp: 10.244.3.57:9042
2017/08/04 15:38:18 unable to dial "10.244.3.57": dial tcp 10.244.3.57:9042: i/o timeout
2017/08/04 15:38:18 gocql: Session.handleNodeUp: 10.244.4.34:9042
2017/08/04 15:38:18 gocql: Session.handleNodeDown: 10.244.3.57:9042
2017/08/04 15:38:19 unable to dial "10.244.4.34": dial tcp 10.244.4.34:9042: i/o timeout
2017/08/04 15:38:19 gocql: Session.handleNodeDown: 10.244.4.34:9042
2017/08/04 15:38:19 gocql: Session.handleNodeUp: 10.244.4.36:9042
2017/08/04 15:38:21 unable to dial "10.244.4.36": dial tcp 10.244.4.36:9042: i/o timeout
2017/08/04 15:38:21 gocql: Session.handleNodeUp: 10.244.9.5:9042
2017/08/04 15:38:21 gocql: Session.handleNodeDown: 10.244.4.36:9042
2017/08/04 15:38:21 gocql: Session.handleNodeUp: 10.244.3.66:9042
2017/08/04 15:38:23 unable to dial "10.244.3.66": dial tcp 10.244.3.66:9042: i/o timeout
2017/08/04 15:38:23 gocql: Session.handleNodeUp: 10.244.2.81:9042
to create tables
2017/08/04 15:38:23 gocql: Session.handleNodeDown: 10.244.3.66:9042
2017/08/04 15:39:23 Session.ring:[10.244.6.10:DOWN][10.244.3.57:DOWN][10.244.4.36:DOWN][10.244.3.66:DOWN][10.244.2.81:UP][10.244.5.5:DOWN][10.244.3.71:UP][10.244.4.34:DOWN][10.244.9.5:UP]
2017/08/04 15:39:23 gocql: Session.handleNodeUp: 10.244.6.10:9042
2017/08/04 15:39:25 unable to dial "10.244.6.10": dial tcp 10.244.6.10:9042: i/o timeout

This is more like what nodetool status tells. So Does this suggests something wrong about Session.handleNodeUp?

Also I found kubernetes/kubernetes#49618 is a better way when stopping pod, I will change the stateful sets, clean all data, and try adding and deleting pods.

idealhack avatar Aug 04 '17 08:08 idealhack

I think whats going on here is your cassandra cluster has stale nodes in gossip. Gocql will get an node up event, then refresh the ring which returns the down nodes (the system tables do not include gossip state). The question here is how long should gocql keep trying to connect to downed nodes before they are removed from the local ring? If you do nodetool removenode <node> does gocql drop the node from the cache? Gocqls local ring cache should be the same as the output of nodetool status. Something to do would be to add a source to the events so if the driver triggers a node up it is apparent and has a reason.

One issue I can see is that the ring describer wont remove nodes which is what leads to the first logs you posted https://github.com/gocql/gocql/blob/b96c067a43582b10f95d9e9dabb926483909908a/host_source.go#L663

What issue do you see when the driver is in this state?

Zariel avatar Aug 04 '17 20:08 Zariel

Sorry I'm not that familiar with cassandra nor gocql.

I think the main issue is gocql somehow reports a ring contains some UP nodes which are DOWN actually. As time goes, the number of this kind of nodes keeps adding up, Eventually:

2017/08/04 15:29:23 Session.ring:[10.244.4.36:UP][10.244.9.5:UP][10.244.3.71:UP][10.244.6.10:UP][10.244.4.34:UP][10.244.4.47:DOWN][10.244.3.66:UP][10.244.4.51:UP][10.244.2.81:UP][10.244.5.5:UP][10.244.3.57:UP]

But according to nodetool status, these nodes were DOWN all along. @Zariel Do you suggests this is because of gocql were not removing these nodes?

Also I have not tried nodetool removenode.

When I was posting the first comment I thought these errors may lead some consistence problems, but it seems consistence is only affect by the level.

idealhack avatar Aug 08 '17 02:08 idealhack

Since gocql does not try to reconnect what is the best way to handle "gocql: no hosts available in the pool"?

robdefeo avatar Feb 22 '18 15:02 robdefeo

@robdefeo We use gocql for our analytics engine at Mailgun. We currently restart the service once every 2 days to ensure the connection pool is full and monitor the size of the pool by emitting metrics on the size of the connection pool. (We modify gocql to achieve this) This is a temporary solution until we have sufficient time to formulate a full patch to gocql.

If I don’t find time for working on a patch this quarter I’ll be very unhappy. This has been a major pain point for us.

thrawn01 avatar May 23 '18 12:05 thrawn01

Hi, Folks, any updates on this story? We recently had a big outage that seems to be partially related to this error. I'm testing it locally and can see this error message showing. Basically what I tried is pause the cassandra docker image and restart it(To mimic whole cassandra cluster down). gocql complains no hosts available in the pool because it doesn't recreate session in this case. Is there any suggestions on this scenario? Do we need to manually recreate session in this case? I'm kind of hesitated on this because i suspect this error could happen in some other scenarios so recreating session probably won't fit in those cases. Could you please confirm that? Thanks in advance.

guanw avatar Apr 17 '19 18:04 guanw

I am using gocql in my kubernetes cluster with a 3-node cassandra setup. It works fine. However, if I want to test locally on my machine, I usually use kubectl port-forward xxx to be able to connect to the cassandra cluster:

kubectl port-forward --namespace cassandra service/cassandra 9042:9042

gocql seems to have a problem with that, as it discovers the cluster but apparently wants to connect to the nodes directly:

2019/07/06 13:18:36 gocql: Session.handleNodeUp: 10.42.96.11:9042
2019/07/06 13:18:36 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:18:38 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:18:40 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:18:41 unable to dial "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout
2019/07/06 13:18:41 gocql: Session.handleNodeDown: 10.42.96.11:9042
2019/07/06 13:18:41 Server is running on:
http://localhost:4000
2019/07/06 13:18:41 Playground is available at:
http://localhost:4000/api/playground

10.42.96.11 is the local IP in the node cluster, but obviously this is not available locally on my machine, only localhost:9042

The weird thing is that after ~5 seconds of trying, my application starts, and I can query my Cassandra cluster. I tried setting:

cluster.DisableInitialHostLookup = true
cluster.IgnorePeerAddr = true

That didn't help, though.

Also, after another 20-30 seconds, the node seems to be flapping up and down again:

2019/07/06 13:18:41 Server is running on:
http://localhost:4000
2019/07/06 13:18:41 Playground is available at:
http://localhost:4000/api/playground
2019/07/06 13:19:41 Session.ring:[10.42.96.11:DOWN][127.0.0.1:UP]
2019/07/06 13:19:41 gocql: Session.handleNodeUp: 10.42.96.11:9042
2019/07/06 13:19:42 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:19:43 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:19:45 connection failed "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout, reconnecting with *gocql.ConstantReconnectionPolicy
2019/07/06 13:19:46 unable to dial "10.42.96.11": dial tcp 10.42.96.11:9042: i/o timeout
[10x the same message]

Is there anything I can do to prevent this? Locally, it's totally fine if gocql only connects with a single node, it's just for development purposes, and as I said, gocql works perfectly fine when deployed in the production kubernetes cluster.

steebchen avatar Jul 11 '19 19:07 steebchen

@steebchen I've not testing your setup. I only use a single node locally to develop. However, disabling the initial host lookup will only keep gocql from asking the control node (the first connection) about other nodes in the cluster. It will not keep the control node from telling the client (gocql) about changes to the status of other nodes in the cluster. (STATUS_CHANGE events, host UP, DOWN events etc...) if gocql receives an event about another node from the control node, it will attempt to connect to the other node given the address address provided in the event. This Might be why after a few seconds gocql attempts to connect to another node and then warns it's down when it can't connect. It's receiving cluster information about another node and attempts to connect to it. It's annoying, but it shouldn't effect anything. You should be able to send queries through the control node just fine.

thrawn01 avatar Jul 21 '19 04:07 thrawn01

We also ran into similar issue recently with Cassandra deployed as statefulset in a Kubernetes cluster.

A little details about our setup - Our Kubernetes cluster is consisting of 5 worker nodes hosting a 3 node Cassandra cluster. We've anti-affinity rules defined for Cassandra which means all 3 nodes of Cassandra are running on different Kubernetes worker nodes for high availability.

On the Kubernetes cluster, Cassandra is exposed as Kubernetes service. Go clients then connect to Cassandra cluster through this Kubernetes service name which is essentially the DNS for to the ip address of the running container pod.

Now about the issue - The issue is visible whenever a worker node hosting a Cassandra pod goes down. As expected, Kubernetes is able to successfully reschedule that Cassandra pod instance to a new available worker node & that Cassandra pod is able to successfully rejoin the Cassandra cluster.

In the above scenario, Cassandra pod instance will come up with a different ip address but under the same DNS name. However, looking at the gocql documentations it looks like there is an assumption that user need to only pass the ip addresses and not the DNS which is really not possible in such setups because the moment a Cassandra node goes down & restarted, it will come up with a different ip address but with same DNS. Could it be the case that the gocql driver is unable to re-establish because it is still trying to do that against the old ip address?

I feel such an assumption is not apt because if the driver is supplied with a DNS name, it should use it while trying to reconnect.

This has been already resolved in the official Java Cassandra driver by Datastax team. Here is the ticket for your reference.

So, would you please prioritize this ticket & help with necessary corrections?

sanjimoh avatar Oct 15 '19 04:10 sanjimoh

Okay, I'll try and have a look. Considering we're beginning to work with k8 as well, this may end up handy and sooner than expected.

alourie avatar Oct 15 '19 08:10 alourie

@alourie thank you for looking at it!

Is there any timeline by when could we expect a resolution? Unfortunately, its a critical need for us.

sanjimoh avatar Oct 17 '19 14:10 sanjimoh

@sanjimoh Sorry, it would be hard to timeline it. I'm finishing up something first, then will get to it, probably mid-next week. From there it could take some time until I figure this out.

As I said, we need it too, so I wouldn't delay this too much.

alourie avatar Oct 18 '19 04:10 alourie

Hi, did you get a chance to check this now?

sanjimoh avatar Oct 29 '19 14:10 sanjimoh

Not yet, but planning to do it next week.

On Wed, Oct 30, 2019 at 12:37 AM Sanjit Mohanty [email protected] wrote:

Hi, did you get a chance to check this now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gocql/gocql/issues/831?email_source=notifications&email_token=AAAN2B53IIG2E7R7KTTIJSDQRA7SZA5CNFSM4CWKN7Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECQTXDI#issuecomment-547437453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAN2B6GDRD4FPZJIMIRXTDQRA7SZANCNFSM4CWKN7QQ .

-- Alex Lourie

alourie avatar Oct 30 '19 01:10 alourie

I have some personal circumstances that won't allow me to look at this for awhile. Sorry about that.

alourie avatar Nov 13 '19 14:11 alourie

Hi - We are facing same problem described by @sanjimoh , is there any resolution for this yet?

vadalikrishna avatar Nov 26 '19 02:11 vadalikrishna

this happened today on my local machine where I forgot to close Iter instance, I haven't run anything on prod with quorum setup, I run locally with one node.

elbek avatar Jan 02 '20 00:01 elbek

@alourie : Could it be worked on now? If not you, anyone else from the library maintainers?

sanjimoh avatar Jan 02 '20 02:01 sanjimoh

Of the people who have developed their own techniques to work around this which of the two apparent strategies are you using:

  • The java driver fix which keeps the original cluster hostname around for reconnections, resolving each time it is used
  • The more catastrophic fix: recreate the cluster session from scratch when "the problem" is noticed

?

Are there additional strategies?

cdent avatar Nov 02 '20 13:11 cdent

This is related to #1575, particularly https://github.com/gocql/gocql/issues/1575#issuecomment-933757395

martin-sucha avatar Dec 22 '21 15:12 martin-sucha

I faced this problem too. I solved this with config ConnectObserver from ClusterConfig to listen to the connection state. When it has an issue and recreates another session

vikage avatar Nov 24 '22 02:11 vikage