scylladb Coordinator node timed out waiting for replica nodes' responses while read `system.scylla_local` table from new added node ( during read load scenario only)

Packages

Scylla version: 6.2.0~dev-20240816.afee3924b3dc with build-id c01d2a55a9631178e3fbad3869c20ef3c8dcf293

Kernel Version: 6.8.0-1013-aws

Issue description

[x] This issue is a regression.
[ ] It is unknown if this issue is a regression.

NOTE This problem was not observed in previous run with 6.2.0~dev-20240809.3745d0a53457 build-id d168ac61c2ac38ef15c6deba485431e666a8fdef. But was reproduced twice this week. It happens with read load scenario only

Issue scenario

Cluster has 3 nodes. Run cassandra-stress read load. 3 new nodes are added simultaneously :

elasticity-test-ubuntu-db-node-5b6ddc1c-4
elasticity-test-ubuntu-db-node-5b6ddc1c-5
elasticity-test-ubuntu-db-node-5b6ddc1c-6

Scylla initialising is completed on all 3 nodes at ~2024-08-18 11:07.

Query from system.scylla_local table from elasticity-test-ubuntu-db-node-5b6ddc1c-4 (10.4.5.206) new node at 2024-08-18 11:08:41:

< t:2024-08-18 11:08:41,109 f:common.py       l:1843 c:utils                p:DEBUG > Executing CQL 'SELECT value FROM system.scylla_local WHERE key='enabled_features'' ...

Node elasticity-test-ubuntu-db-node-5b6ddc1c-1 entered tablet migration transition state at 2024-08-18 11:09:07:

2024-08-18T11:09:07.147+00:00 elasticity-test-ubuntu-db-node-5b6ddc1c-1     !INFO | scylla[6485]:  [shard 0: gms] raft_topology - entered `tablet migration` transition state

Read from elasticity-test-ubuntu-db-node-5b6ddc1c-4 new node timed out at 2024-08-18 11:09:11:

2024-08-18 11:09:11.850: (DisruptionEvent Severity.ERROR) period_type=end event_id=4994dcdc-bcaa-4350-b46c-9f3d4a9d47fb duration=35m8s: nemesis_name=GrowShrinkCluster target_node=Node elasticity-test-ubuntu-db-node-5b6ddc1c-3 [52.214.72.215 | 10.4.5.73] errors=Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for system.scylla_local - received only 0 responses from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5213, in wrapper
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     result = method(*args[1:], **kwargs)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4083, in disrupt_grow_shrink_cluster
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self._grow_cluster(rack=None)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4105, in _grow_cluster
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.add_new_nodes(count=add_nodes_number, rack=rack)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 180, in wrapped
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     res = func(*args, **kwargs)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4022, in add_new_nodes
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self._wait_for_tablets_balanced(nodes[0])
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5091, in _wait_for_tablets_balanced
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     if not node.raft.is_enabled:
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/usr/local/lib/python3.10/functools.py", line 981, in __get__
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     val = self.func(instance)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 739, in raft
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return get_raft_mode(self)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/__init__.py", line 332, in get_raft_mode
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return RaftFeature(node) if is_consistent_cluster_management_feature_enabled(session) else NoRaft(node)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/features.py", line 63, in is_consistent_cluster_management_feature_enabled
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return CONSISTENT_CLUSTER_MANAGEMENT_FEATURE in get_enabled_features(session)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/features.py", line 46, in get_enabled_features
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     result = session.execute("SELECT value FROM system.scylla_local WHERE key='enabled_features'").one()
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1836, in execute_verbose
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     return execute_orig(*args, **kwargs)
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "cassandra/cluster.py", line 2729, in cassandra.cluster.Session.execute
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "cassandra/cluster.py", line 5120, in cassandra.cluster.ResponseFuture.result
< t:2024-08-18 11:09:11,852 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > cassandra.ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for system.scylla_local - received only 0 responses from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0

Nothing running on the new node in this time:

Aug 18 11:08:22.468132 elasticity-test-ubuntu-db-node-5b6ddc1c-4 scylla[6553]:  [shard 0:comp] compaction - [Compact system.raft 2eb9ba80-5d52-11ef-9fa4-459de24b3c1e] Compacted 2 sstables to [/var/lib/scylla/data/system/raft-3e17774c57f539939625327cbafbf5bb/me-3gis_0uxy_2mb292feb1imi9h3by-big-Data.db:level=0]. 1MB to 1MB (~99% of original) in 26ms = 44MB/s. ~256 total partitions merged to 1.
Aug 18 11:09:17.347483 elasticity-test-ubuntu-db-node-5b6ddc1c-4 scylla[6553]:  [shard 6:strm] stream_session - [Stream #0f8c7766-5d52-11ef-9fa4-459de24b3c1e] Write to sstable for ks=keyspace1, cf=standard1, estimated_partitions=5076608, received_partitions=5076296

CQL on the elasticity-test-ubuntu-db-node-5b6ddc1c-4: Screenshot from 2024-08-21 10-40-21

No reported CL ONE queries at this time: Screenshot from 2024-08-21 10-44-30

Cluster load: Screenshot from 2024-08-21 10-35-25

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

It was reproduced twice (second time - https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=f1c7994a-232c-4e90-a215-d3e9a6972fb8)

Installation details

Cluster size: 3 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

elasticity-test-ubuntu-db-node-5b6ddc1c-6 (52.209.140.115 | 10.4.4.231) (shards: 7)
elasticity-test-ubuntu-db-node-5b6ddc1c-5 (3.250.180.15 | 10.4.7.7) (shards: 7)
elasticity-test-ubuntu-db-node-5b6ddc1c-4 (3.255.171.127 | 10.4.5.206) (shards: 7)
elasticity-test-ubuntu-db-node-5b6ddc1c-3 (52.214.72.215 | 10.4.5.73) (shards: 7)
elasticity-test-ubuntu-db-node-5b6ddc1c-2 (3.252.155.158 | 10.4.6.81) (shards: 7)
elasticity-test-ubuntu-db-node-5b6ddc1c-1 (63.34.29.37 | 10.4.5.33) (shards: 7)

OS / Image: ami-0f440d7175113787f (aws: undefined_region)

Test: scylla-master-perf-regression-latency-650gb-elasticity Test id: 5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36 Test name: scylla-master/perf-regression/scylla-master-perf-regression-latency-650gb-elasticity Test method: performance_regression_test.PerformanceRegressionTest.test_latency_read_with_nemesis Test config file(s):

perf-regression-latency-650gb-elasticity.yaml

Logs and commands

Restore Monitor Stack command: $ hydra investigate show-monitor 5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36

Logs:

db-cluster-5b6ddc1c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36/20240818_120050/db-cluster-5b6ddc1c.tar.gz
sct-runner-events-5b6ddc1c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36/20240818_120050/sct-runner-events-5b6ddc1c.tar.gz
sct-5b6ddc1c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36/20240818_120050/sct-5b6ddc1c.log.tar.gz
loader-set-5b6ddc1c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36/20240818_120050/loader-set-5b6ddc1c.tar.gz
monitor-set-5b6ddc1c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/5b6ddc1c-37f3-4473-aa88-ce4ca1a99f36/20240818_120050/monitor-set-5b6ddc1c.tar.gz

Jenkins job URL Argus

Aug 21 '24 08:08 juliayakovlev

@gleb-cloudius can you please look into this? Cc @kbr-scylla

Aug 21 '24 09:08 bhalevy

Regarding the area/tablets label, I don't see how tablets are related to this issue.

Aug 21 '24 09:08 bhalevy

@tgrabiec can tablets logic cause a timeout of local queries?

Aug 21 '24 09:08 bhalevy

@tgrabiec can tablets logic cause a timeout of local queries?

I don't see how. Maybe tablet migration overloads the disk for some reason.

Aug 21 '24 09:08 tgrabiec

@gleb-cloudius can you please look into this? Cc @kbr-scylla

What do do you want me to look at? I see that all nodes (include three new once) were successfully booted. Took a couple seconds for each. Stunning success. Who reads from the local table? The test?

Aug 21 '24 09:08 gleb-cloudius

@gleb-cloudius can you please look into this? Cc @kbr-scylla

What do do you want me to look at? I see that all nodes (include three new once) were successfully booted. Took a couple seconds for each. Stunning success. Who reads from the local table? The test?

yes, test reads

Aug 21 '24 10:08 juliayakovlev

@gleb-cloudius can you please look into this? Cc @kbr-scylla

What do do you want me to look at? I see that all nodes (include three new once) were successfully booted. Took a couple seconds for each. Stunning success. Who reads from the local table? The test?

yes, test reads

Test reads from system.scylla_local?

Aug 21 '24 10:08 gleb-cloudius

@gleb-cloudius can you please look into this? Cc @kbr-scylla

What do do you want me to look at? I see that all nodes (include three new once) were successfully booted. Took a couple seconds for each. Stunning success. Who reads from the local table? The test?

yes, test reads

Test reads from system.scylla_local?

Yes. This is the query (from issue description):

Executing CQL 'SELECT value FROM system.scylla_local WHERE key='enabled_features'' ...

Aug 21 '24 10:08 juliayakovlev

Logs have:

2024-08-18T11:10:09.859+00:00 elasticity-test-ubuntu-db-node-5b6ddc1c-4  !WARNING | scylla[6553]:  [shard 6:stmt] token_metadata - topology version 166 held for 4.898 [s] past expiry, released at: 0x5fa704e 0x5fa7660 0x5fa7968 0x3c8142a 0x3c749e6 0x3c21ae3 0x2f85257 0x2f84573 0x30544ee 0x5aab3af 0x5aac92a 0x5aadb07 0x5ad1440 0x5a6e04a /opt/scylladb/libreloc/libc.so.6+0x97506 /opt/scylladb/libreloc/libc.so.6+0x11b40b

Decoded:

void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:97
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:148
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:181
locator::token_metadata_impl::~token_metadata_impl() at ././locator/token_metadata.cc:1150
std::default_delete<locator::token_metadata_impl>::operator()(locator::token_metadata_impl*) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:93
 (inlined by) std::unique_ptr<locator::token_metadata_impl, std::default_delete<locator::token_metadata_impl> >::~unique_ptr() at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:398
 (inlined by) locator::token_metadata::~token_metadata() at ././locator/token_metadata.cc:844
seastar::lw_shared_ptr_no_esft<locator::token_metadata>::~lw_shared_ptr_no_esft() at ././seastar/include/seastar/core/shared_ptr.hh:170
 (inlined by) seastar::internal::lw_shared_ptr_accessors_no_esft<locator::token_metadata>::dispose(seastar::lw_shared_ptr_counter_base*) at ././seastar/include/seastar/core/shared_ptr.hh:226
 (inlined by) seastar::lw_shared_ptr<locator::token_metadata const>::~lw_shared_ptr() at ././seastar/include/seastar/core/shared_ptr.hh:324
 (inlined by) locator::effective_replication_map::~effective_replication_map() at ././locator/abstract_replication_strategy.hh:192
 (inlined by) locator::tablet_effective_replication_map::~tablet_effective_replication_map() at ././locator/tablets.cc:687
 (inlined by) seastar::shared_ptr_count_for<locator::tablet_effective_replication_map>::~shared_ptr_count_for() at ././seastar/include/seastar/core/shared_ptr.hh:491
 (inlined by) seastar::shared_ptr_count_for<locator::tablet_effective_replication_map>::~shared_ptr_count_for() at ././seastar/include/seastar/core/shared_ptr.hh:491
seastar::shared_ptr<locator::effective_replication_map const>::~shared_ptr() at ././seastar/include/seastar/core/shared_ptr.hh:569
 (inlined by) service::digest_read_resolver::~digest_read_resolver() at ././service/storage_proxy.cc:4362
seastar::shared_ptr_count_for<service::digest_read_resolver>::~shared_ptr_count_for() at ././seastar/include/seastar/core/shared_ptr.hh:491
 (inlined by) seastar::shared_ptr_count_for<service::digest_read_resolver>::~shared_ptr_count_for() at ././seastar/include/seastar/core/shared_ptr.hh:491
seastar::shared_ptr<service::digest_read_resolver>::~shared_ptr() at ././seastar/include/seastar/core/shared_ptr.hh:569
 (inlined by) service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}::~future() at ././service/storage_proxy.cc:5175
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}, seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >::then_wrapped_nrvo<void, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}>(service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}&, seastar::future_state<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >&&)#1}, seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >::~continuation() at ././seastar/include/seastar/core/future.hh:731
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}, seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >::then_wrapped_nrvo<void, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}>(service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, service::abstract_read_executor::make_digest_requests(seastar::shared_ptr<service::digest_read_resolver>, gms::inet_address*, gms::inet_address*, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::{lambda(seastar::future<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >)#1}&, seastar::future_state<seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >&&)#1}, seastar::rpc::tuple<query::result_digest, long, cache_temperature, std::optional<full_position> > >::run_and_dispose() at ././seastar/include/seastar/core/future.hh:752
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2634
seastar::reactor::run_some_tasks() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3104
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3272
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4531
seastar::posix_thread::start_routine(void*) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591

So something holds the read for too long. We are in the middle for tablets migration while this is happening @tgrabiec.

Aug 21 '24 10:08 gleb-cloudius

@juliayakovlev

It was reproduced twice (second time - https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=f1c7994a-232c-4e90-a215-d3e9a6972fb8)

Number #10 was the first failure, correct?

It happened on commit afee3924b3dc.

This problem was not observed in previous run with 6.2.0~dev-20240809.3745d0a53457 build-id d168ac61c2ac38ef15c6deba485431e666a8fdef. But was reproduced twice this week. It happens with read load scenario only

If it's a regression between 3745d0a53457 and afee3924b3dc:

$ git log --oneline --first-parent  3745d0a53457..afee3924b3dc
afee3924b3 s3/client: check for "Key" and "Value" tag in "Tag" XML tag
f24f5b7db2 test.py: Fix boost XML conversion to allure when XML file is empty
3407ae5d8f [test.py] Add Junit logger for boost test
6d6f93e4b5 Merge 'test/nodetool: enable running nodetool tests under test/nodetool' from Kefu Chai
f7eb74e31f cql3/statements/create_service_level: forbid creating SL starting with `$`
5ce07e5d84 build: cmake: add compiler-training target
f5f65ead1e Add `.clang-format`, also add CLion build folder to the `.gitignore` file
66d72e010c distributed_loader: Lock table via global table ptr
7e3e5cfcad sstable_directory: Simplify special-purpose local-only constructor
28d3b91cce Merge 'test/perf/perf_sstables: use test_modes as the type of its option' from Kefu Chai
3de4e8f91b Merge 'cql: process LIMIT for GROUP BY select queries' from Paweł Zakrzewski
8c257db283 Merge 'Native reverse pages over RPC' from Łukasz Paszkowski
99be8de71e doc: set 6.1 as the latest stable version
4d77faa61e cql_test_env: ensure shutdown() before stop() for system_keyspace
4ba9cb0036 README: Update the version of C++ to C++23
a3d53bd224 Merge 'Prevent ALTERing non-existing KS with tablets' from Piotr Smaron
d82fd8b5f0 Merge 'Relax sstable_directory::process_descriptor() call graph' from Pavel Emelyanov
607abe96e8 test/sstable: merge test_using_reusable_sst*()
90363ce802 test: Test the SSTable validation API against malformed SSTables
fec57c83e6 Merge 'cell_locker: maybe_rehash: ignore allocation failures' from Benny Halevy
0ae04ee819 build: cmake: use $<CONFIG:cfgs> when appropriate
318278ff92 Merge 'tablets: reload only changed metadata' from Botond Dénes
e18a855abe extensions: Add exception types for IO extensions and handle in memtable write path
75829d75ec replica: Fix race between split compaction and migration
1f4b9a5300 Merge 'compaction: drop compaction executors' possibility to bypass task manager' from Aleksandra Martyniuk
0bb1075a19 Merge 'tasks: fix task handler' from Aleksandra Martyniuk

Aug 21 '24 11:08 kbr-scylla

@juliayakovlev

It was reproduced twice (second time - https://argus.scylladb.com/test/e5b4605c-4796-4e91-95e0-56dff1dfa341/runs?additionalRuns[]=f1c7994a-232c-4e90-a215-d3e9a6972fb8)

Number #10 was the first failure, correct?

that's right

It happened on commit afee392.

This problem was not observed in previous run with 6.2.0~dev-20240809.3745d0a53457 build-id d168ac61c2ac38ef15c6deba485431e666a8fdef. But was reproduced twice this week. It happens with read load scenario only

If it's a regression between 3745d0a and afee392:

$ git log --oneline --first-parent  3745d0a53457..afee3924b3dc
afee3924b3 s3/client: check for "Key" and "Value" tag in "Tag" XML tag
f24f5b7db2 test.py: Fix boost XML conversion to allure when XML file is empty
3407ae5d8f [test.py] Add Junit logger for boost test
6d6f93e4b5 Merge 'test/nodetool: enable running nodetool tests under test/nodetool' from Kefu Chai
f7eb74e31f cql3/statements/create_service_level: forbid creating SL starting with `$`
5ce07e5d84 build: cmake: add compiler-training target
f5f65ead1e Add `.clang-format`, also add CLion build folder to the `.gitignore` file
66d72e010c distributed_loader: Lock table via global table ptr
7e3e5cfcad sstable_directory: Simplify special-purpose local-only constructor
28d3b91cce Merge 'test/perf/perf_sstables: use test_modes as the type of its option' from Kefu Chai
3de4e8f91b Merge 'cql: process LIMIT for GROUP BY select queries' from Paweł Zakrzewski
8c257db283 Merge 'Native reverse pages over RPC' from Łukasz Paszkowski
99be8de71e doc: set 6.1 as the latest stable version
4d77faa61e cql_test_env: ensure shutdown() before stop() for system_keyspace
4ba9cb0036 README: Update the version of C++ to C++23
a3d53bd224 Merge 'Prevent ALTERing non-existing KS with tablets' from Piotr Smaron
d82fd8b5f0 Merge 'Relax sstable_directory::process_descriptor() call graph' from Pavel Emelyanov
607abe96e8 test/sstable: merge test_using_reusable_sst*()
90363ce802 test: Test the SSTable validation API against malformed SSTables
fec57c83e6 Merge 'cell_locker: maybe_rehash: ignore allocation failures' from Benny Halevy
0ae04ee819 build: cmake: use $<CONFIG:cfgs> when appropriate
318278ff92 Merge 'tablets: reload only changed metadata' from Botond Dénes
e18a855abe extensions: Add exception types for IO extensions and handle in memtable write path
75829d75ec replica: Fix race between split compaction and migration
1f4b9a5300 Merge 'compaction: drop compaction executors' possibility to bypass task manager' from Aleksandra Martyniuk
0bb1075a19 Merge 'tasks: fix task handler' from Aleksandra Martyniuk

Aug 21 '24 11:08 juliayakovlev

@juliayakovlev how this can be reproduced?

WARNING | scylla[6553]: [shard 6:stmt] token_metadata - topology version 166 held for 4.898 [s] past expiry, released at: ...

@denesb are there any commits by you or team from the list above that could cause a regression by holding up effective replication map for that long?

Aug 21 '24 12:08 bhalevy

Also, regarding service::abstract_read_executor::make_digest_requests above, does that mean this coming from read repair?

Aug 21 '24 12:08 bhalevy

@denesb are there any commits by you or team from the list above that could cause a regression by holding up effective replication map for that long?

It is the other way around. erm is held by a read that hangs.

Aug 21 '24 12:08 gleb-cloudius

Also, regarding service::abstract_read_executor::make_digest_requests above, does that mean this coming from read repair?

No. Why would we repair local table.

Aug 21 '24 12:08 gleb-cloudius

Also, regarding service::abstract_read_executor::make_digest_requests above, does that mean this coming from read repair?

No. Why would we repair local table.

Local read wouldn't send a digest request, just local data request, right? So this belongs to some other non-local read.

Aug 21 '24 12:08 tgrabiec

Also, regarding service::abstract_read_executor::make_digest_requests above, does that mean this coming from read repair?

No. Why would we repair local table.

Local read wouldn't send a digest request, just local data request, right? So this belongs to some other non-local read.

Yes, but they still use digest resolver to do that. Digest resolver sends both data and digest reads.

Aug 21 '24 13:08 gleb-cloudius

Maybe the read fails on replica side and we just don't see the cause of failure because it's not logged or propagated?

Aug 21 '24 13:08 tgrabiec

@denesb are there any commits by you or team from the list above that could cause a regression by holding up effective replication map for that long?

AFAIK ERM is held in storage proxy and from a quick glance on the list above, nothing seems to change the proxy.

Aug 21 '24 13:08 denesb

Maybe the read fails on replica side and we just don't see the cause of failure because it's not logged or propagated?

This is a read from a local table:

SELECT value FROM system.scylla_local WHERE key='enabled_features'

Aug 21 '24 14:08 gleb-cloudius

Reads from local table should not hold ERM BTW.

Aug 21 '24 14:08 denesb

Maybe the read fails on replica side and we just don't see the cause of failure because it's not logged or propagated?

This is a read from a local table:

SELECT value FROM system.scylla_local WHERE key='enabled_features'

I'm talking about the read which held erm from continuation attached to make_digest_requests(). That must be a different, non-local, read, because local read, even if it calls make_digest_requests(), would exit immediately and not create any continuation, as the digest replicas are empty.

Aug 21 '24 14:08 tgrabiec

Reads from local table should not hold ERM BTW.

Yes, but this is a generic code. We can add a check for the table type somwhere and drop erm, but the problem here is that something blocks the read. We are just lucky that erm warning let us know what happens.

Aug 22 '24 07:08 gleb-cloudius

Maybe the read fails on replica side and we just don't see the cause of failure because it's not logged or propagated?

This is a read from a local table: SELECT value FROM system.scylla_local WHERE key='enabled_features'

I'm talking about the read which held erm from continuation attached to make_digest_requests(). That must be a different, non-local, read, because local read, even if it calls make_digest_requests(), would exit immediately and not create any continuation, as the digest replicas are empty.

Right. It also uses tablet_effective_replication_map so this is a read from a tablet table which local table is not. @juliayakovlev what other reads the test is doing.

Aug 22 '24 07:08 gleb-cloudius

There is a lot of traces there. Not all of them are the same. I also found this one in handle_tablet_split_finalization:

void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:97
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:148
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:181
locator::token_metadata_impl::~token_metadata_impl() at ././locator/token_metadata.cc:1150
std::default_delete<locator::token_metadata_impl>::operator()(locator::token_metadata_impl*) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:93
 (inlined by) std::unique_ptr<locator::token_metadata_impl, std::default_delete<locator::token_metadata_impl> >::~unique_ptr() at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/unique_ptr.h:398
 (inlined by) locator::token_metadata::~token_metadata() at ././locator/token_metadata.cc:844
seastar::lw_shared_ptr_no_esft<locator::token_metadata>::~lw_shared_ptr_no_esft() at ././seastar/include/seastar/core/shared_ptr.hh:170
 (inlined by) seastar::internal::lw_shared_ptr_accessors_no_esft<locator::token_metadata>::dispose(seastar::lw_shared_ptr_counter_base*) at ././seastar/include/seastar/core/shared_ptr.hh:226
 (inlined by) seastar::lw_shared_ptr<locator::token_metadata const>::~lw_shared_ptr() at ././seastar/include/seastar/core/shared_ptr.hh:324
 (inlined by) service::topology_coordinator::handle_tablet_split_finalization(service::group0_guard) at ././service/topology_coordinator.cc:1418
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:122
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2634
seastar::reactor::run_some_tasks() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3104
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3272
seastar::reactor::run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3162
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ././main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
main at ././main.cc:2230

@tgrabiec

Aug 22 '24 08:08 gleb-cloudius

There is a lot of traces there. Not all of them are the same. I also found this one in handle_tablet_split_finalization:

What's the timestamp of that one?

It's strange, because handle_tablet_split_finalization() is called from the topology coordinator, and topology version shouldn't change concurrently with it, unless we're issuing manual tablet movement API. Or there is some kind of a delay which makes the coroutine of handle_tablet_split_finalization() release the frame long after the new version set by update_topology_state() takes effect inside update_topology_state(). Maybe update_topology_state() is blocked somewhere after the change is already committed.

Aug 22 '24 08:08 tgrabiec

Maybe the read fails on replica side and we just don't see the cause of failure because it's not logged or propagated?

This is a read from a local table: SELECT value FROM system.scylla_local WHERE key='enabled_features'

I'm talking about the read which held erm from continuation attached to make_digest_requests(). That must be a different, non-local, read, because local read, even if it calls make_digest_requests(), would exit immediately and not create any continuation, as the digest replicas are empty.

Right. It also uses tablet_effective_replication_map so this is a read from a tablet table which local table is not. @juliayakovlev what other reads the test is doing.

The test runs cassandra-stress reads with RF=3. It maybe reads from system_schema (or other systems) tables that calls while nemeses

Aug 22 '24 09:08 juliayakovlev

There is a lot of traces there. Not all of them are the same. I also found this one in handle_tablet_split_finalization:

What's the timestamp of that one?

elasticity-test-ubuntu-db-node-5b6ddc1c-1/messages.log:2024-08-18T07:45:20.647+00:00 elasticity-test-ubuntu-db-node-5b6ddc1c-1  !WARNING | scylla[6485]:  [shard 0: gms] token_metadata - topology version 14 held for 12.629 [s] past expiry, released at: 0x5fa704e 0x5fa7660 0x5fa7968 0x3c8142a 0x3c749e6 0x47f5701 0x13fbc4a 0x5aab3af 0x5aac92a 0x5aadb07 0x5aaceb8 0x5a3d2b3 0x5a3c613 0x13a6595 0x13a7f50 0x13a49b3 /opt/scylladb/libreloc/libc.so.6+0x2a087 /opt/scylladb/libreloc/libc.so.6+0x2a14a 0x13a2034

Looks like it is long before reads, but still not clear why is it happening.

Aug 22 '24 09:08 gleb-cloudius

Just a guess

Maybe a call to "storage_service/quiesce_topology" can be doing those reads ?

I.e. the test is doing those calls multiple times to figure out when tablets are balanced

Aug 25 '24 15:08 fruch

Just a guess

Maybe a call to "storage_service/quiesce_topology" can be doing those reads ?

This call doesn't do any reads.

Aug 26 '24 22:08 tgrabiec