python-driver cassandra/cluster.py:4249: in cassandra.cluster.ControlConnection.wait_for_schema_agreement failed with mixed cluster, while all nodes reported that have same schema

Base Scylla version 6.2.3-0.20250119.bff9ddde1283 with build-id 1d62472ad27cbcd54a8de19cc9171ebf3cde0af7 Target Scylla version 2025.2.0~dev-0.20250327.b292b5800bac with build-id 453890f380e35626a04adbac5f902826ad50550c Test: opology_operation_during_upgrade_test::TopologyOperationWithMixedCLusterT::test_trigger_snapshot_transfer Scylla issue: https://github.com/scylladb/scylladb/issues/22535

The scylla issue reproduced very rarely. Latest run failed when node1 and node2 was upgraded to target scylla version, node3 was stopped, and node4, node5 have base scylla version. Upon creating new schema, wait_schema_agreement method failed:

dtest_class.py:217: in create_ks_query
    session.cluster.control_connection.wait_for_schema_agreement(wait_time=120)
cassandra/cluster.py:4219: in cassandra.cluster.ControlConnection.wait_for_schema_agreement
    ???
cassandra/cluster.py:4261: in cassandra.cluster.ControlConnection.wait_for_schema_agreement
    ???
cassandra/cluster.py:4249: in cassandra.cluster.ControlConnection.wait_for_schema_agreement
    ???

After checking alive node logs, all off them get schema agreement: node1

INFO  2025-03-27 06:58:25,789 [shard 0: gms] schema_tables - Creating keyspace ks2
INFO  2025-03-27 06:58:25,793 [shard 0: gms] migration_manager - Gossiping my schema version e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:25,793 [shard 0: gms] schema_tables - Schema version changed to e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:26,592 [shard 0:main] raft_group_registry - marking Raft server 5a4fbc1b-df43-4869-a189-b9ecc32bf2b3 as dead for raft groups

node2

INFO  2025-03-27 06:58:25,792 [shard 0: gms] schema_tables - Creating keyspace ks2
INFO  2025-03-27 06:58:25,795 [shard 0: gms] migration_manager - Gossiping my schema version e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:25,795 [shard 0: gms] schema_tables - Schema version changed to e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:26,634 [shard 0:main] raft_group_registry - marking Raft server 5a4fbc1b-df43-4869-a189-b9ecc32bf2b3 as dead for raft groups

node4

INFO  2025-03-27 06:58:25,792 [shard 0: gms] schema_tables - Creating keyspace ks2
INFO  2025-03-27 06:58:25,795 [shard 0: gms] migration_manager - Gossiping my schema version e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:25,796 [shard 0: gms] schema_tables - Schema version changed to e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:26,666 [shard 0:main] raft_group_registry - marking Raft server 5a4fbc1b-df43-4869-a189-b9ecc32bf2b3 as dead for raft groups

node5

INFO  2025-03-27 06:58:25,792 [shard 0: gms] schema_tables - Creating keyspace ks2
INFO  2025-03-27 06:58:25,795 [shard 0: gms] migration_manager - Gossiping my schema version e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:25,796 [shard 0: gms] schema_tables - Schema version changed to e1511d08-0ad8-11f0-e064-7b3045669105
INFO  2025-03-27 06:58:26,674 [shard 0:main] raft_group_registry - marking Raft server 5a4fbc1b-df43-4869-a189-b9ecc32bf2b3 as dead for raft groups

Logs from last run with reproduce dtest-gw1.log node1.log node2.log node3.log node4.log node5.log

It could the issue discussed here: https://github.com/scylladb/python-driver/issues/453

Mar 27 '25 10:03 aleksbykov

I think there's a race between these two:

06:58:25,787 629     dtest_class                    DEBUG    dtest_class.py      :211  | test_trigger_snapshot_transfer: CREATE KEYSPACE ks2 WITH replication={'class':'NetworkTopologyStrategy', 'replication_factor':3}

and

06:58:27,604 629     cassandra.cluster              WARNING  threading.py        :1075 | test_trigger_snapshot_transfer: Host 127.0.88.3:9042 has been marked down

creation of keyspace is done, and for some reason indeed it takes >2 seconds for schema agreement, and by that time the server is marked as down, and indeed there's an exception raised due to it:

                except ConnectionShutdown:
                    if self._is_shutdown:
                        log.debug("[control connection] Aborting wait for schema match due to shutdown")
                        return None
                    else:
                        raise    <------ this one.

I don't think we know (yet) why it takes 2 seconds to complete the schema agreement process. You might be right and it is slow (most likely) but we don't know why. Clearly if that node it tries to contact for the schema agreement is going down, things will be slow (does it timeout, close the connections, etc.) I suspect without additional traces, we won't know why the schema agreement is slow on the driver side (I don't see any trace/debug level logs in that function)

Mar 27 '25 11:03 mykaul

Fix is here, please take a look

Mar 27 '25 15:03 dkropachev

Fix is here, please take a look

Looks good. Perhaps we can improve the raise with some more debug, but perhaps it's not needed.

Mar 27 '25 16:03 mykaul