scylladb
scylladb copied to clipboard
Error during repair after upgrading node
Installation details
Kernel Version: 5.10.0-17-cloud-amd64
Scylla version (or git commit hash): 5.0.2-20220807.299122e78 with build-id 01a9300a0f113d968f4d199f85240a43ff7f1a5a
Cluster size: 4 nodes (n1-highmem-8)
Scylla Nodes used in this run:
- rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-4 (34.73.147.55 | 10.142.0.121) (shards: 8)
- rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-3 (34.75.185.62 | 10.142.0.120) (shards: 8)
- rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-2 (35.231.33.253 | 10.142.0.116) (shards: 8)
- rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 (34.148.82.254 | 10.142.0.115) (shards: 8)
OS / Image: https://www.googleapis.com/compute/v1/projects/debian-cloud/global/images/family/debian-11 (gce: us-east1)
Test: rolling-upgrade-debian11-test
Test id: e1d6ddc9-3cf0-4013-b0d5-c36ab3150245
Test name: scylla-5.1/rolling-upgrade/rolling-upgrade-debian11-test
Test config file(s):
Issue description
After upgrading node to 5.1.0~rc1-0.20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486 we hit error during repair:
2022-09-05 14:52:01.964: (NodetoolEvent Severity.ERROR) period_type=end event_id=202541d0-a1c5-4801-a719-48f2ac09a2e6 duration=38s: nodetool_command=repair node=rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 errors=['Encountered a bad command exit code!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\' repair "\n\nExit code: 2\n\nStdout:\n\n[2022-09-05 14:51:27,616] Repair session 1 \n[2022-09-05 14:51:27,617] Repair session 1 finished\n[2022-09-05 14:51:38,734] Repair session 2 \n[2022-09-05 14:51:38,734] Repair session 2 finished\n[2022-09-05 14:51:38,740] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:40,842] Repair session 3 \n[2022-09-05 14:51:40,842] Repair session 3 finished\n[2022-09-05 14:51:40,856] Starting repair command #4, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:56,957] Repair session 4 failed\n[2022-09-05 14:51:56,958] Repair session 4 finished\n\nStderr:\n\nerror: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n-- StackTrace --\njava.lang.RuntimeException: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n\tat org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)\n\tat org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)\n\n\n', 'Traceback (most recent call last):\n File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2528, in run_nodetool\n self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)\n File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run\n result = _run()\n File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner\n return func(*args, **kwargs)\n File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run\n return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)\n File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute\n result = connection.run(**command_kwargs)\n File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run\n return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)\n File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run\n raise UnexpectedExit(result)\nsdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\' repair "\n\nExit code: 2\n\nStdout:\n\n[2022-09-05 14:51:27,616] Repair session 1 \n[2022-09-05 14:51:27,617] Repair session 1 finished\n[2022-09-05 14:51:38,734] Repair session 2 \n[2022-09-05 14:51:38,734] Repair session 2 finished\n[2022-09-05 14:51:38,740] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:40,842] Repair session 3 \n[2022-09-05 14:51:40,842] Repair session 3 finished\n[2022-09-05 14:51:40,856] Starting repair command #4, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:56,957] Repair session 4 failed\n[2022-09-05 14:51:56,958] Repair session 4 finished\n\nStderr:\n\nerror: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n-- StackTrace --\njava.lang.RuntimeException: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n\tat org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)\n\tat org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)\n\n\n\n']
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2528, in run_nodetool
self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra' repair "
Exit code: 2
Stdout:
[2022-09-05 14:51:27,616] Repair session 1
[2022-09-05 14:51:27,617] Repair session 1 finished
[2022-09-05 14:51:38,734] Repair session 2
[2022-09-05 14:51:38,734] Repair session 2 finished
[2022-09-05 14:51:38,740] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-09-05 14:51:40,842] Repair session 3
[2022-09-05 14:51:40,842] Repair session 3 finished
[2022-09-05 14:51:40,856] Starting repair command #4, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)
[2022-09-05 14:51:56,957] Repair session 4 failed
[2022-09-05 14:51:56,958] Repair session 4 finished
Stderr:
error: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
In db logs we can see:
Sep 05 14:51:56 rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 scylla[146264]: [shard 0] repair - repair[426d1799-bb87-4c00-98c7-4af35827ca7b]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 1: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 2: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 3: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 4: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 5: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 6: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 7: seastar::rpc::remote_verb_error (seastar::nested_exception)})
During upgrade test when we did:
- Upgrade node 1
- upgrade node 2
- rollback node 2
- repair node 1 - this is where error happened
This is happening in many upgrade tests for 5.1 (and enterprise 2022.2).
- Restore Monitor Stack command:
$ hydra investigate show-monitor e1d6ddc9-3cf0-4013-b0d5-c36ab3150245 - Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs e1d6ddc9-3cf0-4013-b0d5-c36ab3150245
Logs:
- db-cluster-e1d6ddc9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e1d6ddc9-3cf0-4013-b0d5-c36ab3150245/20220905_150159/db-cluster-e1d6ddc9.tar.gz
- monitor-set-e1d6ddc9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e1d6ddc9-3cf0-4013-b0d5-c36ab3150245/20220905_150159/monitor-set-e1d6ddc9.tar.gz
- loader-set-e1d6ddc9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e1d6ddc9-3cf0-4013-b0d5-c36ab3150245/20220905_150159/loader-set-e1d6ddc9.tar.gz
- sct-runner-e1d6ddc9.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e1d6ddc9-3cf0-4013-b0d5-c36ab3150245/20220905_150159/sct-runner-e1d6ddc9.tar.gz
@asias can you please take a look?
same happens for 2022.2.0~rc1
this is blocking 5.1
The repair follower node failed to write rows
Sep 05 14:51:43 rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 scylla[146264]: [shard 3] repair - repair[426d1799-bb87-4c00-98c7-4af35827ca7b]: shard=3, keyspace=scylla_bench, cf=test, range=(-4233469355322020765, -4229514567880051100], got error in row level repair: std::runtime_error (put_row_diff: Repair follower=10.142.0.120 failed in put_row_diff hanlder, status=0)
We might have messed up with the features during upgrade.
Possible problematic commits from 299122e78..d10aee15e7e9
commit 78eccd8763aaab201b8628958d70ee98121e7dad
Merge: 08ed4d7405 f81f1c7ef7
Author: Avi Kivity <[email protected]>
Date: Thu May 19 12:02:58 2022 +0300
Merge "Remove sstable_format_slector::sync()" from Pavel E
"
There's an explicit barrier in main that waits for the sstable format
selector to finish selecting it by the time node start to join a cluter.
(Actually -- not quite, when restarting a normal node it joins cluster
in prepare_to_join()).
This explicit barrier is not needed, the sync point already exists in
the way features are enabled, the format-selector just needs to use it.
branch: https://github.com/xemul/scylla/tree/br-format-selector-sync
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/351/
refs: #2795
"
* 'br-format-selector-sync' of https://github.com/xemul/scylla:
format-selector: Remove .sync() point
format-selector: Coroutinize maybe_select_format()
format-selector: Coroutinize simple methods
commit b1b22f2c2bf25657e8e41575a9b9ec266957583b
Author: Kamil Braun <[email protected]>
Date: Fri Apr 15 15:55:29 2022 +0200
service: raft: don't support/advertise USES_RAFT feature
commit cd5fec8a2306c5edc77e10649069b07388f5febd
Merge: aab052c0d5 ebc2178ea5
Author: Tomasz Grabiec <[email protected]>
Date: Fri Mar 18 12:27:11 2022 +0100
Merge "raft: re-advertise gossiper features when raft feature support changes" from Pavel
Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't
properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
raft feature.
This small series consists of 3 parts to fix the handling of supported
features for raft:
1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the
`raft_group_registry`.
2. Update `system.local#supported_features` directly in the
`feature_service::support()` method.
3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper
value in the support callback within `raft_group_registry`.
* manmanson/track_supported_set_recalculation_v7:
raft: re-advertise gossiper features when raft feature support changes
raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft
gms: feature_service: update `system.local#supported_features` when feature support changes
test: cql_test_env: enable features in a `seastar::thread`
@asias I don;t understand whats the next step /cc @eliransin
@asias we are now seeing this across the board in the master upgrade tests:
https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-debian10-test/148/ https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-debian11-test/12/ https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-ubuntu20.04-test/88/ https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-ubuntu18.04-test/216/
@asias ping I don;t understand what you wrote in the above - what can we do to move forward here
/cc @eliransin
@asias ping I don;t understand what you wrote in the above - what can we do to move forward here
We need to bisect to find the problematic commit between https://github.com/scylladb/scylladb/commit/299122e78d8e0a4a3500a8e3b8d97ca0ef927a16 and https://github.com/scylladb/scylladb/commit/d10aee15e7e98354897ee18c58cf8d5fa0677212.
EDIT: moved this comment to a new issue in #11551
i listed the commits between 5.0.2 (that passed without pain, to 5.0.3 and i found this list:
Fabio, the test passed with 5.0.2 and started to fail with 5.0.3? If that's the case indeed d07e90298 and dfdc128fa looks suspicious?
b9a61c8e9 release: prepare for 5.0.3
32aa1e528 transport/server.cc: Return correct size of decompressed lz4 buffer
da6a126d7 cross-tree: fix header file self-sufficiency
d07e90298 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
3c0fc42f8 cql3: fix misleading error message for service level timeouts
964ccf919 type_json: support integers in scientific format
dfdc128fa Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec
Fabio, the test passed with 5.0.2 and started to fail with 5.0.3?
yes, that is the case, but only the upgrade from 4.6 to 5.0.3 (upgrade from 5.0.2 to 5.0.3 passed 1 out of 2 runs, and the failure was kind of similar)
@asias , how can we make progress from here? should we start building images with 5.0.3 without these commits and run reproducers? cc @slivne @eliransin
I tried to reproduce locally with the 5.0.2-20220807.299122e78 and 5.1.0~rc1-0.20220902.d10aee15e7e9 as a mixed cluster + repair. I could not reproduce repair failure.
@asias , how can we make progress from here? should we start building images with 5.0.3 without these commits and run reproducers? cc @slivne @eliransin
I suggest moving the non repair issue you found during upgrade to a separate issue. Your issue was query got timeout.
While the 11459 was about repair failed in a mixed cluster during upgrade.
@asias , how can we make progress from here? should we start building images with 5.0.3 without these commits and run reproducers? cc @slivne @eliransin
I suggest moving the non repair issue you found during upgrade to a separate issue. Your issue was query got timeout. While the 11459 was about repair failed in a mixed cluster during upgrade.
https://github.com/scylladb/scylladb/issues/11459#issuecomment-1246774019
@asias, @fgelcer, @yarongilor
Just adding some information I've noticed, and isn't mention here on the bug
The actual failure on node1
Sep 17 06:03:51 rolling-upgrade--ubuntu-focal-db-node-96c53cba-0-1 scylla[36181]: [shard 0] repair - repair_writer: keyspace=scylla_bench, table=test, multishard_writer failed: seastar::nested_exception: std::runtime_error (SSTable write failed due to existence of TOC file for generation 56 of scylla_bench.test) (while cleaning up after std::runtime_error (Dangling queue_reader_handle))
Seems like failure always happens on scylla_bench keyspace, recently a new large partition run with scylla-bench were added to the upgrade tests
https://github.com/scylladb/scylla-cluster-tests/pull/5136
and this is the scylla-bench command that was used, before/during the upgrade:
stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data
Let’s run a test without this s-b command and see if it works or not.
On Sun, Sep 18, 2022 at 09:53 Israel Fruchter @.***> wrote:
@asias https://github.com/asias, @fgelcer https://github.com/fgelcer, @yarongilor https://github.com/yarongilor
Just adding some information I've noticed, and it's mention here on the bug
The actual failure on node1
Sep 17 06:03:51 rolling-upgrade--ubuntu-focal-db-node-96c53cba-0-1 scylla[36181]: [shard 0] repair - repair_writer: keyspace=scylla_bench, table=test, multishard_writer failed: seastar::nested_exception: std::runtime_error (SSTable write failed due to existence of TOC file for generation 56 of scylla_bench.test) (while cleaning up after std::runtime_error (Dangling queue_reader_handle))
Seems like failure always happens on scylla_bench keyspace, recently a new large partition run with scylla-bench were added to the upgrade tests
scylladb/scylla-cluster-tests#5136 https://github.com/scylladb/scylla-cluster-tests/pull/5136
and this is the scylla-bench command that was used, before/during the upgrade:
stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data
— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylladb/issues/11459#issuecomment-1250205442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYD4JRAZCUXVFGPYOADV62353ANCNFSM6AAAAAAQFSKRUY . You are receiving this because you commented.Message ID: @.***>
Tested in: https://jenkins.scylladb.com/job/scylla-staging/job/Longevity_yaron/job/rolling-upgrade-debian10-test-large-partitions/9/
==> looks like it runs ok without s-b stress. next step is try pre-creating the s-b table in advance and retest.
@asias A has found that they run scylla-bench in the middle and that may be triggering the problem - can you have a look
The direction @yarongilor is checking at the moment is the fact that this keyspace was created when the cluster was in mixed mode (2 nodes upgraded, 2 haven't). @asias is that make sense it will cause the repair to fail?
The direction @yarongilor is checking at the moment is the fact that this keyspace was created when the cluster was in mixed mode (2 nodes upgraded, 2 haven't). @asias is that make sense it will cause the repair to fail?
This is not supposed to fail. In case of mixed cluster, scylla is supposed to use only the features known by the old nodes.
My guess is that
https://github.com/scylladb/scylladb/issues/11459#issuecomment-1240148316
some of the nodes started to use new features incorrectly.
I could not produce with scylla-bench
start n1,n2,n3
run scylla-bench
upgrade n3
run scylla-bench in the background
run repair scylla-bench on n3
stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data
I could not produce with scylla-bench
start n1,n2,n3 run scylla-bench upgrade n3 run scylla-bench in the background run repair scylla-bench on n3stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data
Maybe try without running scylla-bench before the upgrade, i.e. let scylla-bench create the scheme while in mixed cluster (again it doesn't say you are not correct about the root cause of the repair failure)
I could not produce with scylla-bench
start n1,n2,n3 run scylla-bench upgrade n3 run scylla-bench in the background run repair scylla-bench on n3stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data
Maybe try without running scylla-bench before the upgrade, i.e. let scylla-bench create the scheme while in mixed cluster (again it doesn't say you are not correct about the root cause of the repair failure)
I tried this too. It is the same. I could not reproduce.
@yarongilor did you proved that when running the s-b command that creates the schema before the upgrade, the issue doesn't reproduce?
If so, please provide a reproducer, should be very simple and all stress commands and "fill_db_data" can be omitted so the reproducer can be quick. Also, it should be possible to do it with docker backend.
CC @fgelcer
@yarongilor did you proved that when running the s-b command that creates the schema before the upgrade, the issue doesn't reproduce?
If so, please provide a reproducer, should be very simple and all stress commands and "fill_db_data" can be omitted so the reproducer can be quick. Also, it should be possible to do it with docker backend.
CC @fgelcer
i'm right now trying to run with creating the s-b in the pre-test, so we will have the schemas already created when we reach to the failure point (almost all upgrade tests are failing the same, for 5.1 and 2022.2):
https://github.com/scylladb/scylla-cluster-tests/pull/5310
it will only prove that this is the failure point, while i will try to work in parallel now to have a small reproducer, so we will be able to narrow it few minutes of run, instead of hours...
@fgelcer please update here with your findings. AFAIU your changes proved that it related to when we create the schema for scylla-bench. Do you have a short and easy reproducer with a docker backend?
We agreed:
- QA to provide a reproducer test - try to minimize the runtime
- We will do a bisect over the suspected range
@fgelcer please update here with your findings. AFAIU your changes proved that it related to when we create the schema for scylla-bench. Do you have a short and easy reproducer with a docker backend?
using the code in https://github.com/scylladb/scylla-cluster-tests/pull/5310 i moved the part that create the schema for s-b to the beginning of the test, and then, the test passed without any problems.
for now, i will be working on to reduce the reproducer time to the minimal possible (or until we find a reliable reproducer), and will post here