scylla-ccm `cluster.populate(n).start()` does not wait for nodes to start

`cluster.populate(n).start()` does not wait for nodes to start

Open kbr- opened this issue 5 years ago • 1 comments

and in effect some dtests are failing.

e.g. with the following setup: scylla-ccm https://github.com/scylladb/scylla-ccm/commit/d1e62bacaef834f827b5bd5655c5900f0eef8221 (next and master at the moment of writing this issue) + scylla-dtest https://github.com/scylladb/scylla-dtest/commit/cd83fd8f8956befd1af256b91dee6456093413c0 (next and master at the moment of writing this issue)

running:

nosetests -v -s concurrent_schema_changes_test.py:TestConcurrentSchemaChanges.snapshot_test

gives:

2020-01-31 15:08:22.142366 going to run tests sequentially
2020-01-31 15:08:22.142422 using the SingleClusterIdAllocator
snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges) ... 2020-01-31 15:08:22.143590 cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
2020-01-31 15:08:22.143644 Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
2020-01-31 15:08:22.189404 snapshot_test()
ERROR
2020-01-31 15:08:28.350926 Test failed with exception: <class 'cassandra.cluster.NoHostAvailable'>: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
2020-01-31 15:08:28.354925 stopping ccm cluster test at: /home/kbraun/.dtest/dtest-2k9Sq9

======================================================================
ERROR: snapshot_test (concurrent_schema_changes_test.TestConcurrentSchemaChanges)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kbraun/dev/scylla-dtest/concurrent_schema_changes_test.py", line 499, in snapshot_test
    session = self.cql_connection(node1)
  File "/home/kbraun/dev/scylla-dtest/dtest.py", line 704, in cql_connection
    protocol_version, port=port, ssl_opts=ssl_opts, **kwargs)
  File "/home/kbraun/dev/scylla-dtest/dtest.py", line 742, in _create_session
    session = cluster.connect()
  File "cassandra/cluster.py", line 1335, in cassandra.cluster.Cluster.connect
    with self._lock:
  File "cassandra/cluster.py", line 1371, in cassandra.cluster.Cluster.connect
    raise
  File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
    self.control_connection.connect()
  File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
    self._set_new_connection(self._reconnect_internal())
  File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
-------------------- >> begin captured logging << --------------------
dtest: DEBUG:  - going to run tests sequentially
dtest: DEBUG:  - using the SingleClusterIdAllocator
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - cluster ccm directory: /home/kbraun/.dtest/dtest-2k9Sq9
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - Starting Scylla cluster from directory /home/kbraun/dev/scylla-dtest/../scylla/build/dev
dtest: DEBUG: concurrent_schema_changes_test.TestConcurrentSchemaChanges.snapshot_testsnapshot_test - snapshot_test()
cassandra.cluster: WARNING: Cluster.__init__ called with contact_points specified, but load-balancing policies are not specified in some ExecutionProfiles. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['127.0.0.1'], EPs without explicit LBPs = ('EXEC_P
ROFILE_DEFAULT',))
cassandra.cluster: WARNING: [control connection] Error connecting to 127.0.0.1:9042:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 2928, in cassandra.cluster.ControlConnection._reconnect_internal
    return self._try_connect(host)
  File "cassandra/cluster.py", line 2950, in cassandra.cluster.ControlConnection._try_connect
    connection = self._cluster.connection_factory(host.endpoint, is_control_connection=True)
  File "cassandra/cluster.py", line 1292, in cassandra.cluster.Cluster.connection_factory
    return self.connection_class.factory(endpoint, self.connect_timeout, *args, **kwargs)
  File "cassandra/connection.py", line 471, in cassandra.connection.Connection.factory
    conn = cls(endpoint, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
    self._connect_socket()
  File "cassandra/connection.py", line 514, in cassandra.connection.Connection._connect_socket
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error: %s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
error: [Errno 111] Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused
cassandra.cluster: ERROR: Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
  File "cassandra/cluster.py", line 1358, in cassandra.cluster.Cluster.connect
    self.control_connection.connect()
  File "cassandra/cluster.py", line 2896, in cassandra.cluster.ControlConnection.connect
    self._set_new_connection(self._reconnect_internal())
  File "cassandra/cluster.py", line 2939, in cassandra.cluster.ControlConnection._reconnect_internal
    raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1:9042': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 7.293s

FAILED (errors=1)

Here's how the test starts:

    def snapshot_test(self):
        debug("snapshot_test()")
        cluster = self.cluster
        cluster.set_configuration_options(values={'experimental': True})
        cluster.populate(2).start()
        node1, node2 = cluster.nodelist()
        wait(2)
        session = self.cql_connection(node1)

By reading the code and debugging, I arrived at the conclusion that the reason is https://github.com/scylladb/scylla-ccm/commit/83bdc2dcff893728587070aee69db70b88863eab, specifically these lines:

-        if no_wait and not verbose:
-            # waiting 2 seconds to check for early errors and for the
-            # pid to be set
-            time.sleep(2)
-        else:
-            for node, p, mark in started:
-                start_message = "Starting listening for CQL clients"
-                try:
-                    # updated code, scylla starts CQL only by default
-                    # process should not be checked for scylla as the
-                    # process is a boot script (that ends after boot)
-                    node.watch_log_for(start_message, timeout=600,
-                                       verbose=verbose, from_mark=mark)
-                except RuntimeError:
-                    raise Exception("Not able to find start "
-                                    "message '%s' in Node '%s'" %
-                                    (start_message, node.name))

the else branch was causing the wait in older ccm versions (which caused start() to take ~25 seconds longer to finish than it currently does). I guess this is now responsible for the wait:

        if wait_for_binary_proto:
            for node, _, mark in started:
                node.watch_log_for("Starting listening for CQL clients",
                                   verbose=verbose, from_mark=mark)

but

waiting for binary proto should probably be different/something more than just waiting for a certain message appearing in the node's logs (see wait_for_binary_interface in ccmlib/node.py)
lots of dtests do cluster.populate(n).start() (so wait_for_binary_proto=False) and probably assume that the cluster is ready after this:

[kbraun@localhost scylla-dtest]$ grep -r "populate.*start()" | wc -l
185

Jan 31 '20 15:01 kbr-

I don't completely understand what's happening though, because e.g. snapshot_test, which fails on my laptop 100% of the time when trying to run dtests using branches scylla-ccm:master and scylla-dtest:master, sometimes passes on BYO: https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/909/consoleFull

Jan 31 '20 15:01 kbr-

scylla-ccm scylla-ccm copied to clipboard

`cluster.populate(n).start()` does not wait for nodes to start

scylla-ccm
scylla-ccm copied to clipboard