scylla-cluster-tests Alter the restore nemesis for raft (and tablets) scenarios

Issue description

[x] This issue is a regression.
[ ] It is unknown if this issue is a regression.

When raft is enabled, the manager cannot execute a restore schema task on the cluster:

Command: 'sudo sctool restore -c fbb525d8-fe74-4b5a-be5f-04840bba0c72 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20230223105105UTC'

Exit code: 1

Stdout:



Stderr:

Error: create restore target, units and views: init target: restore into cluster with given ScyllaDB version and consistent_cluster_management is not supported. See https://manager.docs.scylladb.com/stable/restore/restore-schema.html for a workaround.

As described in https://manager.docs.scylladb.com/stable/restore/restore-schema.html, in order to restore into such cluster, one must first disable raft, execute the restore, and then enable raft back again. *if tablets is enabled as well, it must be disabled and enabled alongside raft

Impact

The restore nemesis (or any other restore operation in sct) cannot run with raft enabled clusters.

Apr 03 '24 10:04 ShlomiBalalis

@ShlomiBalalis as far as I know, this is something that will be fixed in manager 3.3. Manager 3.3 is support for tablets so raft + tablets will always be enabled. @karol-kokoszka is my statement correct? @mikliapko - FYI

Apr 04 '24 04:04 rayakurl

@ShlomiBalalis as far as I know, this is something that will be fixed in manager 3.3. Manager 3.3 is support for tablets so raft + tablets will always be enabled. @karol-kokoszka is my statement correct? @mikliapko - FYI

I can't recall that, but if so - fantastic

Apr 04 '24 13:04 ShlomiBalalis

@ShlomiBalalis IIRC you can't disable raft once enabled. I think procedure shows how to upgrade snapshots to 'raft enabled'. We need to fix our snapshots.

Apr 22 '24 05:04 soyacz

@mikliapko Do we know when manager 3.3 is going to be available? If too far, we need a workaround in SCT to skip manager restore nemesis until it's available. (We have jobs failing for this error).

Jun 13 '24 11:06 roydahan

@mikliapko Do we know when manager 3.3 is going to be available? If too far, we need a workaround in SCT to skip manager restore nemesis until it's available. (We have jobs failing for this error).

There is no need for workaround in SCT I believe. We are going to have Manager 3.3 released coming days. @karol-kokoszka Please, correct me if I'm wrong about release dates.

Jun 13 '24 11:06 mikliapko

Now that manager-3.3 is out, we need to make sure this works and backport the usage of it to all branches.

Jul 09 '24 08:07 roydahan

reproduced in https://argus.scylladb.com/test/379b8fb7-0e42-4453-bc77-b857a8758986/runs?additionalRuns[]=c2048c63-a1d8-44a2-b0d9-8874c6d9b145

2024-07-22 09:48:43

Jul 23 '24 11:07 temichus

I think it's not enough for the restore nemesis, since all of the backups were created with older scylla and older manager.

someone would need to recreate the backend with manager 3.3, for it to be working,

isn't that @mikliapko ?

Jul 23 '24 12:07 fruch

I think it's not enough for the restore nemesis, since all of the backups were created with older scylla and older manager.

I'm not sure about that, or at least do not remember it from the original case. it's weird to have such limitation.

Jul 23 '24 12:07 roydahan

I think it's not enough for the restore nemesis, since all of the backups were created with older scylla and older manager.

someone would need to recreate the backend with manager 3.3, for it to be working,

isn't that @mikliapko ? 2 questions:

Shouldn't we test the latest manager at all times? 3.3.0 is out and soon we release 3.3.1
The flow that Israel described should be supported AFAIK @mikliapko WDYT?

Jul 23 '24 12:07 rayakurl

Yes, schema restore (what is actually being performed in the test) is expected to fail. To restore schema, the backup should be created with Manager 3.3 on Scylla 6.0.

At the same time, data restore is expected to work correctly.

Jul 23 '24 12:07 mikliapko

@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?

Jul 23 '24 12:07 rayakurl

@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?

Yes, we have a coverage for:

Restore of backup within one version of Scylla and Manager (Manager sanity test);
Data restore from backup made by previous Manager version (Manager upgrade test).

The kind of test we are missing in SCT:

Restore of backup made on any previous Scylla version.

Jul 23 '24 13:07 mikliapko

@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?

Test is using latest releases

The problem is that the restore nemesis is building on top backup made a long time ago with older versions of scylla and manager

Someone needs to recreate the backup, until then all restores would fail

Also the data only restore case is not covered in any nemesis.

Meanwhile we would disable those, until this issue is going to be resolved.

It's just wasting multiple people time

If this coverage is needed for manager, some should be assign to handle it.

Jul 23 '24 13:07 fruch

@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?

Yes, we have a coverage for:

Restore of backup within one version of Scylla and Manager (Manager sanity test);

Data restore from backup made by previous Manager version (Manager upgrade test).

Releases are tested only after the release, and only one release is targeted.
those are not happening during high utilization of the cluster
we need the basic manger actions working as nemesis, for example scylla latency during ops isn't covering backup/restore

The kind of test we are missing in SCT:

Restore of backup made on any previous Scylla version.

You mean data only ?

Jul 23 '24 13:07 fruch

The kind of test we are missing in SCT:

Restore of backup made on any previous Scylla version.

You mean data only ?

I meant both types. But as I got from your previous message there are nemesis tests that cover some schema restore cases. I was not aware of them.

Jul 23 '24 13:07 mikliapko

those are not happening during high utilization of the cluster

we need the basic manger actions working as nemesis, for example scylla latency during ops isn't covering backup/restore

Actually, I'd agree that these are very good things to test but we don't have them covered by tests. @rayakurl FYI

Jul 23 '24 13:07 mikliapko

@mikliapko please use this issue and fix the current nemesis to either restore data only or replace the backup that this nemesis uses with a newer one that supports schema restore.

Jul 24 '24 08:07 roydahan

@mikliapko please restore the schema with a backup that supports this flow.

Jul 24 '24 09:07 rayakurl

@roydahan @fruch @ShlomiBalalis I'm about to prepare the new snapshots for restoration instead of current. Just wondering how the previous snapshots were prepared. Do we have some auxiliary scripts, jobs, tests which may help with that to not implement them again? Maybe you know something about it?

Aug 07 '24 13:08 mikliapko

@roydahan @fruch @ShlomiBalalis I'm about to prepare the new snapshots for restoration instead of current. Just wondering how the previous snapshots were prepared. Do we have some auxiliary scripts, jobs, tests which may help with that to not implement them again? Maybe you know something about it?

if it's not documented in those: https://github.com/scylladb/qa-tasks/issues/1128 https://github.com/scylladb/scylla-cluster-tests/pull/6083

you'll have to peek @ShlomiBalalis brain, for the actual information on how they were prepared.

Aug 07 '24 14:08 fruch

AFAIR, it doesn't really matter, as long as it's not restoring keyspace1 or any keyspace we're using as main stress for our longevities.

Aug 07 '24 15:08 roydahan

So, I've updated snapshots and re-executed the longevity-twcs-3h-test test. It did restore schema and data operations successfully but failed on running verification stress:

2024-08-08 13:23:21.822: (CassandraStressEvent Severity.ERROR) period_type=end event_id=2d5a6c50-00b9-431d-8418-4c5ad0d2ca55 duration=9s: node=Node longevity-twcs-3h-2024-2-m-loader-node-663e2012-2 [34.232.109.10 | 10.12.3.244]
stress_cmd=cassandra-stress read cl=QUORUM n=3579200 -schema 'keyspace=10gb_sizetiered_2024_2_0_rc1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=3579201..7158400
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
java.lang.RuntimeException: Failed to execute

Argus link

@fruch Could you please help to investigate it?

Aug 08 '24 15:08 mikliapko

So, I've updated snapshots and re-executed the longevity-twcs-3h-test test. It did restore schema and data operations successfully but failed on running verification stress:

2024-08-08 13:23:21.822: (CassandraStressEvent Severity.ERROR) period_type=end event_id=2d5a6c50-00b9-431d-8418-4c5ad0d2ca55 duration=9s: node=Node longevity-twcs-3h-2024-2-m-loader-node-663e2012-2 [34.232.109.10 | 10.12.3.244]
stress_cmd=cassandra-stress read cl=QUORUM n=3579200 -schema 'keyspace=10gb_sizetiered_2024_2_0_rc1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=3579201..7158400
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
java.lang.RuntimeException: Failed to execute

Argus link

@fruch Could you please help to investigate it?

the nemesis failed:

Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5213, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2979, in disrupt_mgmt_restore
    assert is_passed, (
AssertionError: Data verification stress command, triggered by the 'mgmt_restore' nemesis, has failed

and all of the stress command failed, opening the loader logs, we can see those errors:

com.datastax.driver.core.exceptions.InvalidQueryException: Unrecognized name C1
	at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
	at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
	at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:86)
	at org.apache.cassandra.stress.util.JavaDriverClient.prepare(JavaDriverClient.java:124)
	at org.apache.cassandra.stress.operations.predefined.CqlOperation$JavaDriverWrapper.createPreparedStatement(CqlOperation.java:318)
	at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:77)
	at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
	at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
	at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)

so I can guess you did write with only one column, which the verification stress command expected 16 columns ?

see defaults/manager_persistent_snapshots.yaml:

  confirmation_stress_template: "cassandra-stress read cl=QUORUM n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native  -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq={sequence_start}..{sequence_end}"

I guess your write command should match this command, for things to work, or the other way around, the verification command should match the command use for writing the data.

Aug 08 '24 16:08 fruch

I guess your write command should match this command, for things to work, or the other way around, the verification command should match the command use for writing the data.

I see, yeah, there are actually some differences in the way I prepared the data in write cmd and the way c-s read is executed. Will work on fixing, thanks!

Aug 08 '24 17:08 mikliapko

Reproduced again in 2024.1.9 https://argus.scylladb.com/test/e3fc93ef-9efc-4ed1-85df-bf913acb2f73/runs?additionalRuns[]=4a0f40dd-b79c-4c74-b68f-355033fe31ac

Aug 30 '24 17:08 dimakr

@mikliapko after the fix was backported and merged to 2024.1 branch today, the original issue is not reproduced (on 2024.1.10), but this one popped out (during disrupt_mgmt_restore nemesis):

Command: 'sudo sctool restore -c ee868eb4-785f-4f2b-9bef-800373d5e84d --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20240812150753UTC'
Exit code: 1
Stdout:
Stderr:
Error: get resource: create restore target, units and views: init units: get tombstone_gc of system_distributed.dicts: not found
Trace ID: mHhVeTgWSbORLt2BYYpIcg (grep in scylla-manager logs)

Does it look familiar? Maybe a post-effect of the fix I'm struggling to understand where the new issue should be reported - for SCT, or maybe for manager.

Sep 19 '24 20:09 dimakr

@mikliapko after the fix was backported and merged to 2024.1 branch today, the original issue is not reproduced (on 2024.1.10), but this one popped out (during disrupt_mgmt_restore nemesis):
Command: 'sudo sctool restore -c ee868eb4-785f-4f2b-9bef-800373d5e84d --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1  --snapshot-tag sm_20240812150753UTC'
Exit code: 1
Stdout:
Stderr:
Error: get resource: create restore target, units and views: init units: get tombstone_gc of system_distributed.dicts: not found
Trace ID: mHhVeTgWSbORLt2BYYpIcg (grep in scylla-manager logs)
Does it look familiar? Maybe a post-effect of the fix I'm struggling to understand where the new issue should be reported - for SCT, or maybe for manager.

We faced the same issue in our Manager tests but only in case of Scylla 2024.2.0-rc2. So, I'm about to open the ticket in Scylla. It seems like the API call failed on Scylla side:

19T13:23:57.915Z","N":"cluster.client","M":"HTTP","host":"10.12.3.110:10001","method":"GET","uri":"/agent/node_info","duration":"8ms","status":500,"bytes":102,"dump":"HTTP/1.1 500 Internal Server Error\r\nContent-Length: 102\r\nContent-Type: application/json\r\nDate: Thu, 19 Sep 2024 13:23:57 GMT\r\n\r\n{\"message\":\"unexpected error, consult logs: version info: invalid

Sep 20 '24 06:09 mikliapko

But in your case you are running the test with Scylla 2024.1.10. Hm, probably it was a mistake to backport it into 2024.1 since the original backup we restore from was created with Scylla 2024.2.0-rc1 where desc schema issue is fixed. I'm not sure we have a valid test configuration trying to restore backup created with Scylla 2024.2.0-rc1 on cluster with 2024.1.10. @Michal-Leszczynski What do you think?

Sep 20 '24 06:09 mikliapko

@mikliapko could you describe the test scenario? In general, it's not possible to restore schema from backup <= 5.4 into cluster >= 6.0. After taking a look at the implementation, it is also not possible to do it the other way around - backup >= 6.0 into cluster <= 5.4 - ~although I think that we could make it possible, it's just missing from the implementation~.

Sep 23 '24 08:09 Michal-Leszczynski

scylla-cluster-tests scylla-cluster-tests copied to clipboard

Alter the restore nemesis for raft (and tablets) scenarios

Issue description

Impact

scylla-cluster-tests
scylla-cluster-tests copied to clipboard