scylla-ccm
scylla-ccm copied to clipboard
`cluster.populate(n).start()` does not wait for nodes to start
and in effect some dtests are failing.
e.g. with the following setup: scylla-ccm https://github.com/scylladb/scylla-ccm/commit/d1e62bacaef834f827b5bd5655c5900f0eef8221 (next and master at the moment of writing this issue) + scylla-dtest https://github.com/scylladb/scylla-dtest/commit/cd83fd8f8956befd1af256b91dee6456093413c0 (next and master at the moment of writing this issue)
running:
nosetests -v -s concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.snapshot_test
gives:
2020-01-31 15:08:22.142366 going to run tests sequentially
2020-01-31 15:08:22.142422 using the SingleClusterIdAllocator
snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges) ... 2020-01-31 15:08:22.143590 cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
2020-01-31 15:08:22.143644 Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
2020-01-31 15:08:22.189404 snapshot_test()
ERROR
2020-01-31 15:08:28.350926 Test failed with exception: <class 'cassandra.cluster.NoHostAvailable'>: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
2020-01-31 15:08:28.354925 stopping ccm cluster test at: /home/kbraun/.dtest/dtest-2k9Sq9
======================================================================
ERROR: snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/kbraun/dev/scylla-dtest/concurrent_schema_changes_test.py", line 499, in snapshot_test
session = self.cql_connection(node1)
File "/home/kbraun/dev/scylla-dtest/dtest.py", line 704, in cql_connection
protocol_version, port=port, ssl_opts=ssl_opts, **kwargs)
File "/home/kbraun/dev/scylla-dtest/dtest.py", line 742, in _create_session
session = cluster.connect()
File "cassandra/cluster.py", line 1335, in cassandra.cluster.Cluster.connect
with self._lock:
File "cassandra/cluster.py", line 1371, in cassandra.cluster.Cluster.connect
raise
File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
self.control_connection.connect()
File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
self._set_new_connection(self._reconnect_internal())
File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
-------------------- >> begin captured logging << --------------------
dtest: DEBUG: - going to run tests sequentially
dtest: DEBUG: - using the SingleClusterIdAllocator
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - snapshot_test()
cassandra.cluster: WARNING: Cluster.__init__ called with contact_points specified, but load-balancing policies are not specified in some ExecutionProfiles. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['127.0.0.1'], EPs without explicit LBPs = ('EXEC_P
ROFILE_DEFAULT',))
cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1:9042:
Traceback (most recent call last):
File "cassandra/cluster.py", line 2928, in cassandra.cluster.ControlConnection._reconnect_internal
return self._try_connect(host)
File "cassandra/cluster.py", line 2950, in cassandra.cluster.ControlConnection._try_connect
connection = self._cluster.connection_factory(host.endpoint, is_control_connection=True)
File "cassandra/cluster.py", line 1292, in cassandra.cluster.Cluster.connection_factory
return self.connection_class.factory(endpoint, self.connect_timeout, *args, **kwargs)
File "cassandra/connection.py", line 471, in cassandra.connection.Connection.factory
conn = cls(endpoint, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
self._connect_socket()
File "cassandra/connection.py", line 514, in cassandra.connection.Connection._connect_socket
raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused
cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
self.control_connection.connect()
File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
self._set_new_connection(self._reconnect_internal())
File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 1 test in 7.293s
FAILED (errors=1)
Here's how the test starts:
def snapshot_test(self):
debug("snapshot_test()")
cluster = self.cluster
cluster.set_configuration_options(values={'experimental': True})
cluster.populate(2).start()
node1, node2 = cluster.nodelist()
wait(2)
session = self.cql_connection(node1)
By reading the code and debugging, I arrived at the conclusion that the reason is https://github.com/scylladb/scylla-ccm/commit/83bdc2dcff893728587070aee69db70b88863eab, specifically these lines:
- if no_wait and not verbose:
- # waiting 2 seconds to check for early errors and for the
- # pid to be set
- time.sleep(2)
- else:
- for node, p, mark in started:
- start_message = "Starting listening for CQL clients"
- try:
- # updated code, scylla starts CQL only by default
- # process should not be checked for scylla as the
- # process is a boot script (that ends after boot)
- node.watch_log_for(start_message, timeout=600,
- verbose=verbose, from_mark=mark)
- except RuntimeError:
- raise Exception("Not able to find start "
- "message '%s' in Node '%s'" %
- (start_message, node.name))
the else
branch was causing the wait in older ccm versions (which caused start()
to take ~25 seconds longer to finish than it currently does). I guess this is now responsible for the wait:
if wait_for_binary_proto:
for node, _, mark in started:
node.watch_log_for("Starting listening for CQL clients",
verbose=verbose, from_mark=mark)
but
- waiting for binary proto should probably be different/something more than just waiting for a certain message appearing in the node's logs (see
wait_for_binary_interface
in ccmlib/node.py) - lots of dtests do
cluster.populate(n).start()
(sowait_for_binary_proto=False
) and probably assume that the cluster is ready after this:
[kbraun@localhost scylla-dtest]$ grep -r "populate.*start()" | wc -l
185
I don't completely understand what's happening though, because e.g. snapshot_test
, which fails on my laptop 100% of the time when trying to run dtests using branches scylla-ccm:master and scylla-dtest:master, sometimes passes on BYO:
https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/909/consoleFull