python-driver Session might not reconnect if all nodes are restarted at once

Observed in the following test run: https://jenkins.scylladb.com/job/scylla-master/job/next/6139/artifact/testlog/x86_64/dev/topology.test_cluster_features.2.log I'm also attaching the logs to the issue: jenkins.scylladb.com_job_scylla-master_job_next_6139_artifact_testlog_x86_64_dev_topology.test_cluster_features.2.log

Link to the source of the failing test: https://github.com/scylladb/scylladb/blob/3a73048bc9a15bebca78dc89143e4e332fb50645/test/topology/test_cluster_features.py#L150

The test test_downgrade_after_successful_upgrade_fails shuts down all nodes in the cluster, reconfigures them and then starts them again. In the linked run, the test session didn't reconnect after it happened:

14:31:46.962 DEBUG> refresh driver node list
14:31:46.963 DEBUG> [control connection] Refreshing node list and token map
14:31:46.963 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:47.003 DEBUG> refresh driver node list
14:31:47.003 DEBUG> [control connection] Refreshing node list and token map
14:31:47.003 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:49.004 DEBUG> Error querying host 127.76.173.27:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.27:9042 is closed
14:31:49.004 DEBUG> Defunct or closed connection (139961296996112) returned to pool, potentially marking host 127.76.173.27:9042 as down
14:31:49.004 DEBUG> Shutting down connections to 127.76.173.27:9042
14:31:49.004 DEBUG> Closing connection (139961296996112) to 127.76.173.27:9042
14:31:51.005 DEBUG> Error querying host 127.76.173.21:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.21:9042 is closed
14:31:51.005 DEBUG> Defunct or closed connection (139961297243024) returned to pool, potentially marking host 127.76.173.21:9042 as down
14:31:51.005 DEBUG> Shutting down connections to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing connection (139961297243024) to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing excess connection (139961297729488) to 127.76.173.21:9042
14:31:53.006 DEBUG> Error querying host 127.76.173.6:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.6:9042 is closed
14:31:53.006 DEBUG> Defunct or closed connection (139961297252176) returned to pool, potentially marking host 127.76.173.6:9042 as down
14:31:53.006 DEBUG> Shutting down connections to 127.76.173.6:9042
14:31:53.006 DEBUG> Closing connection (139961297252176) to 127.76.173.6:9042
14:31:53.007 INFO> Driver not connected to 127.76.173.27:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.21:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.6:9042 yet

...

14:32:46.100 INFO> Driver not connected to 127.76.173.27:9042 yet
14:32:46.100 INFO> Driver not connected to 127.76.173.21:9042 yet
14:32:46.101 INFO> Driver not connected to 127.76.173.6:9042 yet
---------------------------- Captured log teardown -----------------------------
14:32:47.140 DEBUG> after_test for test_downgrade_after_successful_upgrade_fails (success: False)

The driver doesn't reconnect automatically within a minute. IPs of the restarted nodes are the same as before

Jun 16 '23 12:06 piodul

I wonder if it's the same thing as https://github.com/scylladb/python-driver/issues/170 - which was supposedly fixed - or maybe it's some different kind of race which only happens when we restart all nodes?

Jun 16 '23 12:06 kbr-scylla

Please we need some traction on this, it affects the stability of the tests.

Nov 22 '23 18:11 kostja

@avelanarius - ping

Feb 12 '24 09:02 mykaul

@sylwiaszunejko is currently working on this issue: https://github.com/scylladb/python-driver/issues/295 which seemed more current (before issues reported by @piodul today).

I'll try to get to this soon.

Feb 12 '24 11:02 avelanarius

python-driver python-driver copied to clipboard

Session might not reconnect if all nodes are restarted at once

python-driver
python-driver copied to clipboard