scylladb icon indicating copy to clipboard operation
scylladb copied to clipboard

Error during repair after upgrading node

Open soyacz opened this issue 3 years ago • 27 comments

Installation details

Kernel Version: 5.10.0-17-cloud-amd64 Scylla version (or git commit hash): 5.0.2-20220807.299122e78 with build-id 01a9300a0f113d968f4d199f85240a43ff7f1a5a Cluster size: 4 nodes (n1-highmem-8)

Scylla Nodes used in this run:

  • rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-4 (34.73.147.55 | 10.142.0.121) (shards: 8)
  • rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-3 (34.75.185.62 | 10.142.0.120) (shards: 8)
  • rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-2 (35.231.33.253 | 10.142.0.116) (shards: 8)
  • rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 (34.148.82.254 | 10.142.0.115) (shards: 8)

OS / Image: https://www.googleapis.com/compute/v1/projects/debian-cloud/global/images/family/debian-11 (gce: us-east1)

Test: rolling-upgrade-debian11-test Test id: e1d6ddc9-3cf0-4013-b0d5-c36ab3150245 Test name: scylla-5.1/rolling-upgrade/rolling-upgrade-debian11-test Test config file(s):

Issue description

After upgrading node to 5.1.0~rc1-0.20220902.d10aee15e7e9 with build-id c127c717ecffa082ce97b94100d62b2549abe486 we hit error during repair:

2022-09-05 14:52:01.964: (NodetoolEvent Severity.ERROR) period_type=end event_id=202541d0-a1c5-4801-a719-48f2ac09a2e6 duration=38s: nodetool_command=repair node=rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 errors=['Encountered a bad command exit code!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\'  repair "\n\nExit code: 2\n\nStdout:\n\n[2022-09-05 14:51:27,616] Repair session 1 \n[2022-09-05 14:51:27,617] Repair session 1 finished\n[2022-09-05 14:51:38,734] Repair session 2 \n[2022-09-05 14:51:38,734] Repair session 2 finished\n[2022-09-05 14:51:38,740] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:40,842] Repair session 3 \n[2022-09-05 14:51:40,842] Repair session 3 finished\n[2022-09-05 14:51:40,856] Starting repair command #4, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:56,957] Repair session 4 failed\n[2022-09-05 14:51:56,958] Repair session 4 finished\n\nStderr:\n\nerror: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n-- StackTrace --\njava.lang.RuntimeException: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n\tat org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)\n\tat org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)\n\n\n', 'Traceback (most recent call last):\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2528, in run_nodetool\n    self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run\n    result = _run()\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner\n    return func(*args, **kwargs)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run\n    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute\n    result = connection.run(**command_kwargs)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run\n    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)\n  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run\n    raise UnexpectedExit(result)\nsdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!\n\nCommand: "/usr/bin/nodetool -u cassandra -pw \'cassandra\'  repair "\n\nExit code: 2\n\nStdout:\n\n[2022-09-05 14:51:27,616] Repair session 1 \n[2022-09-05 14:51:27,617] Repair session 1 finished\n[2022-09-05 14:51:38,734] Repair session 2 \n[2022-09-05 14:51:38,734] Repair session 2 finished\n[2022-09-05 14:51:38,740] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:40,842] Repair session 3 \n[2022-09-05 14:51:40,842] Repair session 3 finished\n[2022-09-05 14:51:40,856] Starting repair command #4, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)\n[2022-09-05 14:51:56,957] Repair session 4 failed\n[2022-09-05 14:51:56,958] Repair session 4 finished\n\nStderr:\n\nerror: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n-- StackTrace --\njava.lang.RuntimeException: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed\n\tat org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)\n\tat org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)\n\tat com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)\n\n\n\n']
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2528, in run_nodetool
self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 613, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 67, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 604, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 537, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 620, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 654, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  repair "
Exit code: 2
Stdout:
[2022-09-05 14:51:27,616] Repair session 1
[2022-09-05 14:51:27,617] Repair session 1 finished
[2022-09-05 14:51:38,734] Repair session 2
[2022-09-05 14:51:38,734] Repair session 2 finished
[2022-09-05 14:51:38,740] Starting repair command #3, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-09-05 14:51:40,842] Repair session 3
[2022-09-05 14:51:40,842] Repair session 3 finished
[2022-09-05 14:51:40,856] Starting repair command #4, repairing 1 ranges for keyspace scylla_bench (parallelism=SEQUENTIAL, full=true)
[2022-09-05 14:51:56,957] Repair session 4 failed
[2022-09-05 14:51:56,958] Repair session 4 finished
Stderr:
error: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2022-09-05 14:51:56,957] Repair session 4 failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

In db logs we can see:

Sep 05 14:51:56 rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 scylla[146264]:  [shard 0] repair - repair[426d1799-bb87-4c00-98c7-4af35827ca7b]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 1: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 2: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 3: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 4: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 5: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 6: seastar::rpc::remote_verb_error (seastar::nested_exception), shard 7: seastar::rpc::remote_verb_error (seastar::nested_exception)})

During upgrade test when we did:

  1. Upgrade node 1
  2. upgrade node 2
  3. rollback node 2
  4. repair node 1 - this is where error happened

This is happening in many upgrade tests for 5.1 (and enterprise 2022.2).

  • Restore Monitor Stack command: $ hydra investigate show-monitor e1d6ddc9-3cf0-4013-b0d5-c36ab3150245
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e1d6ddc9-3cf0-4013-b0d5-c36ab3150245

Logs:

Jenkins job URL

soyacz avatar Sep 06 '22 08:09 soyacz

@asias can you please take a look?

roydahan avatar Sep 06 '22 13:09 roydahan

same happens for 2022.2.0~rc1

fgelcer avatar Sep 06 '22 13:09 fgelcer

this is blocking 5.1

slivne avatar Sep 07 '22 11:09 slivne

The repair follower node failed to write rows

Sep 05 14:51:43 rolling-upgrade--debian-bullseye-db-node-e1d6ddc9-0-1 scylla[146264]:  [shard 3] repair - repair[426d1799-bb87-4c00-98c7-4af35827ca7b]: shard=3, keyspace=scylla_bench, cf=test, range=(-4233469355322020765, -4229514567880051100], got error in row level repair: std::runtime_error (put_row_diff: Repair follower=10.142.0.120 failed in put_row_diff hanlder, status=0)

We might have messed up with the features during upgrade.

Possible problematic commits from 299122e78..d10aee15e7e9

commit 78eccd8763aaab201b8628958d70ee98121e7dad
Merge: 08ed4d7405 f81f1c7ef7
Author: Avi Kivity <[email protected]>
Date:   Thu May 19 12:02:58 2022 +0300

Merge "Remove sstable_format_slector::sync()" from Pavel E
    
    "
    There's an explicit barrier in main that waits for the sstable format
    selector to finish selecting it by the time node start to join a cluter.
    (Actually -- not quite, when restarting a normal node it joins cluster
    in prepare_to_join()).
    
    This explicit barrier is not needed, the sync point already exists in
    the way features are enabled, the format-selector just needs to use it.
    
    branch: https://github.com/xemul/scylla/tree/br-format-selector-sync
    tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/351/
    refs: #2795
    "
    
    * 'br-format-selector-sync' of https://github.com/xemul/scylla:
      format-selector: Remove .sync() point
      format-selector: Coroutinize maybe_select_format()
      format-selector: Coroutinize simple methods
commit b1b22f2c2bf25657e8e41575a9b9ec266957583b
Author: Kamil Braun <[email protected]>
Date:   Fri Apr 15 15:55:29 2022 +0200

    service: raft: don't support/advertise USES_RAFT feature
commit cd5fec8a2306c5edc77e10649069b07388f5febd
Merge: aab052c0d5 ebc2178ea5
Author: Tomasz Grabiec <[email protected]>
Date:   Fri Mar 18 12:27:11 2022 +0100

    Merge "raft: re-advertise gossiper features when raft feature support changes" from Pavel
    
    Prior to the change, `USES_RAFT_CLUSTER_MANAGEMENT` feature wasn't
    properly advertised upon enabling `SUPPORTS_RAFT_CLUSTER_MANAGEMENT`
    raft feature.
    
    This small series consists of 3 parts to fix the handling of supported
    features for raft:
    1. Move subscription for `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` to the
       `raft_group_registry`.
    2. Update `system.local#supported_features` directly in the
       `feature_service::support()` method.
    3. Re-advertise gossiper state for `SUPPORTED_FEATURES` gossiper
       value in the support callback within `raft_group_registry`.
    
    * manmanson/track_supported_set_recalculation_v7:
      raft: re-advertise gossiper features when raft feature support changes
      raft: move tracking `SUPPORTS_RAFT_CLUSTER_MANAGEMENT` feature to raft
      gms: feature_service: update `system.local#supported_features` when feature support changes
      test: cql_test_env: enable features in a `seastar::thread`



asias avatar Sep 08 '22 02:09 asias

@asias I don;t understand whats the next step /cc @eliransin

slivne avatar Sep 08 '22 14:09 slivne

@asias we are now seeing this across the board in the master upgrade tests:

https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-debian10-test/148/ https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-debian11-test/12/ https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-ubuntu20.04-test/88/ https://jenkins.scylladb.com/job/scylla-master/job/rolling-upgrade/job/rolling-upgrade-ubuntu18.04-test/216/

fruch avatar Sep 11 '22 08:09 fruch

@asias ping I don;t understand what you wrote in the above - what can we do to move forward here

slivne avatar Sep 11 '22 11:09 slivne

/cc @eliransin

slivne avatar Sep 12 '22 11:09 slivne

@asias ping I don;t understand what you wrote in the above - what can we do to move forward here

We need to bisect to find the problematic commit between https://github.com/scylladb/scylladb/commit/299122e78d8e0a4a3500a8e3b8d97ca0ef927a16 and https://github.com/scylladb/scylladb/commit/d10aee15e7e98354897ee18c58cf8d5fa0677212.

asias avatar Sep 13 '22 01:09 asias

EDIT: moved this comment to a new issue in #11551

fgelcer avatar Sep 14 '22 13:09 fgelcer

i listed the commits between 5.0.2 (that passed without pain, to 5.0.3 and i found this list:

Fabio, the test passed with 5.0.2 and started to fail with 5.0.3? If that's the case indeed d07e90298 and dfdc128fa looks suspicious?

b9a61c8e9 release: prepare for 5.0.3
32aa1e528 transport/server.cc: Return correct size of decompressed lz4 buffer
da6a126d7 cross-tree: fix header file self-sufficiency
d07e90298 Merge 'database: evict all inactive reads for table when detaching table' from Botond Dénes
3c0fc42f8 cql3: fix misleading error message for service level timeouts
964ccf919 type_json: support integers in scientific format
dfdc128fa Merge 'row_cache: Fix missing row if upper bound of population range is evicted and has adjacent dummy' from Tomasz Grabiec

asias avatar Sep 15 '22 00:09 asias

Fabio, the test passed with 5.0.2 and started to fail with 5.0.3?

yes, that is the case, but only the upgrade from 4.6 to 5.0.3 (upgrade from 5.0.2 to 5.0.3 passed 1 out of 2 runs, and the failure was kind of similar)

fgelcer avatar Sep 15 '22 06:09 fgelcer

@asias , how can we make progress from here? should we start building images with 5.0.3 without these commits and run reproducers? cc @slivne @eliransin

fgelcer avatar Sep 15 '22 06:09 fgelcer

I tried to reproduce locally with the 5.0.2-20220807.299122e78 and 5.1.0~rc1-0.20220902.d10aee15e7e9 as a mixed cluster + repair. I could not reproduce repair failure.

asias avatar Sep 15 '22 07:09 asias

@asias , how can we make progress from here? should we start building images with 5.0.3 without these commits and run reproducers? cc @slivne @eliransin

I suggest moving the non repair issue you found during upgrade to a separate issue. Your issue was query got timeout.
While the 11459 was about repair failed in a mixed cluster during upgrade.

asias avatar Sep 15 '22 07:09 asias

@asias , how can we make progress from here? should we start building images with 5.0.3 without these commits and run reproducers? cc @slivne @eliransin

I suggest moving the non repair issue you found during upgrade to a separate issue. Your issue was query got timeout. While the 11459 was about repair failed in a mixed cluster during upgrade.

https://github.com/scylladb/scylladb/issues/11459#issuecomment-1246774019

fgelcer avatar Sep 15 '22 07:09 fgelcer

@asias, @fgelcer, @yarongilor

Just adding some information I've noticed, and isn't mention here on the bug

The actual failure on node1

Sep 17 06:03:51 rolling-upgrade--ubuntu-focal-db-node-96c53cba-0-1 scylla[36181]:  [shard 0] repair - repair_writer: keyspace=scylla_bench, table=test, multishard_writer failed: seastar::nested_exception: std::runtime_error (SSTable write failed due to existence of TOC file for generation 56 of scylla_bench.test) (while cleaning up after std::runtime_error (Dangling queue_reader_handle))

Seems like failure always happens on scylla_bench keyspace, recently a new large partition run with scylla-bench were added to the upgrade tests

https://github.com/scylladb/scylla-cluster-tests/pull/5136

and this is the scylla-bench command that was used, before/during the upgrade:

stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data

fruch avatar Sep 18 '22 06:09 fruch

Let’s run a test without this s-b command and see if it works or not.

On Sun, Sep 18, 2022 at 09:53 Israel Fruchter @.***> wrote:

@asias https://github.com/asias, @fgelcer https://github.com/fgelcer, @yarongilor https://github.com/yarongilor

Just adding some information I've noticed, and it's mention here on the bug

The actual failure on node1

Sep 17 06:03:51 rolling-upgrade--ubuntu-focal-db-node-96c53cba-0-1 scylla[36181]: [shard 0] repair - repair_writer: keyspace=scylla_bench, table=test, multishard_writer failed: seastar::nested_exception: std::runtime_error (SSTable write failed due to existence of TOC file for generation 56 of scylla_bench.test) (while cleaning up after std::runtime_error (Dangling queue_reader_handle))

Seems like failure always happens on scylla_bench keyspace, recently a new large partition run with scylla-bench were added to the upgrade tests

scylladb/scylla-cluster-tests#5136 https://github.com/scylladb/scylla-cluster-tests/pull/5136

and this is the scylla-bench command that was used, before/during the upgrade:

stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data

— Reply to this email directly, view it on GitHub https://github.com/scylladb/scylladb/issues/11459#issuecomment-1250205442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE75CYD4JRAZCUXVFGPYOADV62353ANCNFSM6AAAAAAQFSKRUY . You are receiving this because you commented.Message ID: @.***>

roydahan avatar Sep 18 '22 09:09 roydahan

Tested in: https://jenkins.scylladb.com/job/scylla-staging/job/Longevity_yaron/job/rolling-upgrade-debian10-test-large-partitions/9/

==> looks like it runs ok without s-b stress. next step is try pre-creating the s-b table in advance and retest.

yarongilor avatar Sep 19 '22 13:09 yarongilor

@asias A has found that they run scylla-bench in the middle and that may be triggering the problem - can you have a look

slivne avatar Sep 21 '22 11:09 slivne

The direction @yarongilor is checking at the moment is the fact that this keyspace was created when the cluster was in mixed mode (2 nodes upgraded, 2 haven't). @asias is that make sense it will cause the repair to fail?

roydahan avatar Sep 21 '22 15:09 roydahan

The direction @yarongilor is checking at the moment is the fact that this keyspace was created when the cluster was in mixed mode (2 nodes upgraded, 2 haven't). @asias is that make sense it will cause the repair to fail?

This is not supposed to fail. In case of mixed cluster, scylla is supposed to use only the features known by the old nodes.

My guess is that

https://github.com/scylladb/scylladb/issues/11459#issuecomment-1240148316

some of the nodes started to use new features incorrectly.

asias avatar Sep 22 '22 03:09 asias

I could not produce with scylla-bench

start n1,n2,n3
run scylla-bench 
upgrade n3
run scylla-bench in the background
run repair scylla-bench on n3

stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data

asias avatar Sep 22 '22 03:09 asias

I could not produce with scylla-bench

start n1,n2,n3
run scylla-bench 
upgrade n3
run scylla-bench in the background
run repair scylla-bench on n3

stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data

Maybe try without running scylla-bench before the upgrade, i.e. let scylla-bench create the scheme while in mixed cluster (again it doesn't say you are not correct about the root cause of the repair failure)

fruch avatar Sep 22 '22 03:09 fruch

I could not produce with scylla-bench

start n1,n2,n3
run scylla-bench 
upgrade n3
run scylla-bench in the background
run repair scylla-bench on n3

stress_before_upgrade: scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=800 -clustering-row-count=5555 -clustering-row-size=uniform:100..1024 -concurrency=5 -connection-count=5 -consistency-level=quorum -rows-per-request=10 -timeout=30s -validate-data

Maybe try without running scylla-bench before the upgrade, i.e. let scylla-bench create the scheme while in mixed cluster (again it doesn't say you are not correct about the root cause of the repair failure)

I tried this too. It is the same. I could not reproduce.

asias avatar Sep 22 '22 05:09 asias

@yarongilor did you proved that when running the s-b command that creates the schema before the upgrade, the issue doesn't reproduce?

If so, please provide a reproducer, should be very simple and all stress commands and "fill_db_data" can be omitted so the reproducer can be quick. Also, it should be possible to do it with docker backend.

CC @fgelcer

roydahan avatar Sep 22 '22 14:09 roydahan

@yarongilor did you proved that when running the s-b command that creates the schema before the upgrade, the issue doesn't reproduce?

If so, please provide a reproducer, should be very simple and all stress commands and "fill_db_data" can be omitted so the reproducer can be quick. Also, it should be possible to do it with docker backend.

CC @fgelcer

i'm right now trying to run with creating the s-b in the pre-test, so we will have the schemas already created when we reach to the failure point (almost all upgrade tests are failing the same, for 5.1 and 2022.2): https://github.com/scylladb/scylla-cluster-tests/pull/5310

it will only prove that this is the failure point, while i will try to work in parallel now to have a small reproducer, so we will be able to narrow it few minutes of run, instead of hours...

fgelcer avatar Sep 22 '22 14:09 fgelcer

@fgelcer please update here with your findings. AFAIU your changes proved that it related to when we create the schema for scylla-bench. Do you have a short and easy reproducer with a docker backend?

roydahan avatar Sep 28 '22 09:09 roydahan

We agreed:

  • QA to provide a reproducer test - try to minimize the runtime
  • We will do a bisect over the suspected range

slivne avatar Sep 28 '22 11:09 slivne

@fgelcer please update here with your findings. AFAIU your changes proved that it related to when we create the schema for scylla-bench. Do you have a short and easy reproducer with a docker backend?

using the code in https://github.com/scylladb/scylla-cluster-tests/pull/5310 i moved the part that create the schema for s-b to the beginning of the test, and then, the test passed without any problems.

for now, i will be working on to reduce the reproducer time to the minimal possible (or until we find a reliable reproducer), and will post here

fgelcer avatar Sep 28 '22 12:09 fgelcer