scylla-cluster-tests
scylla-cluster-tests copied to clipboard
Alter the restore nemesis for raft (and tablets) scenarios
Issue description
- [x] This issue is a regression.
- [ ] It is unknown if this issue is a regression.
When raft is enabled, the manager cannot execute a restore schema task on the cluster:
Command: 'sudo sctool restore -c fbb525d8-fe74-4b5a-be5f-04840bba0c72 --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1 --snapshot-tag sm_20230223105105UTC'
Exit code: 1
Stdout:
Stderr:
Error: create restore target, units and views: init target: restore into cluster with given ScyllaDB version and consistent_cluster_management is not supported. See https://manager.docs.scylladb.com/stable/restore/restore-schema.html for a workaround.
As described in https://manager.docs.scylladb.com/stable/restore/restore-schema.html, in order to restore into such cluster, one must first disable raft, execute the restore, and then enable raft back again. *if tablets is enabled as well, it must be disabled and enabled alongside raft
Impact
The restore nemesis (or any other restore operation in sct) cannot run with raft enabled clusters.
@ShlomiBalalis as far as I know, this is something that will be fixed in manager 3.3. Manager 3.3 is support for tablets so raft + tablets will always be enabled. @karol-kokoszka is my statement correct? @mikliapko - FYI
@ShlomiBalalis as far as I know, this is something that will be fixed in manager 3.3. Manager 3.3 is support for tablets so raft + tablets will always be enabled. @karol-kokoszka is my statement correct? @mikliapko - FYI
I can't recall that, but if so - fantastic
@ShlomiBalalis IIRC you can't disable raft once enabled. I think procedure shows how to upgrade snapshots to 'raft enabled'. We need to fix our snapshots.
@mikliapko Do we know when manager 3.3 is going to be available? If too far, we need a workaround in SCT to skip manager restore nemesis until it's available. (We have jobs failing for this error).
@mikliapko Do we know when manager 3.3 is going to be available? If too far, we need a workaround in SCT to skip manager restore nemesis until it's available. (We have jobs failing for this error).
There is no need for workaround in SCT I believe. We are going to have Manager 3.3 released coming days. @karol-kokoszka Please, correct me if I'm wrong about release dates.
Now that manager-3.3 is out, we need to make sure this works and backport the usage of it to all branches.
reproduced in https://argus.scylladb.com/test/379b8fb7-0e42-4453-bc77-b857a8758986/runs?additionalRuns[]=c2048c63-a1d8-44a2-b0d9-8874c6d9b145
2024-07-22 09:48:43
I think it's not enough for the restore nemesis, since all of the backups were created with older scylla and older manager.
someone would need to recreate the backend with manager 3.3, for it to be working,
isn't that @mikliapko ?
I think it's not enough for the restore nemesis, since all of the backups were created with older scylla and older manager.
I'm not sure about that, or at least do not remember it from the original case. it's weird to have such limitation.
I think it's not enough for the restore nemesis, since all of the backups were created with older scylla and older manager.
someone would need to recreate the backend with manager 3.3, for it to be working,
isn't that @mikliapko ? 2 questions:
- Shouldn't we test the latest manager at all times? 3.3.0 is out and soon we release 3.3.1
- The flow that Israel described should be supported AFAIK @mikliapko WDYT?
Yes, schema restore (what is actually being performed in the test) is expected to fail. To restore schema, the backup should be created with Manager 3.3 on Scylla 6.0.
At the same time, data restore is expected to work correctly.
@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?
@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?
Yes, we have a coverage for:
- Restore of backup within one version of Scylla and Manager (Manager sanity test);
- Data restore from backup made by previous Manager version (Manager upgrade test).
The kind of test we are missing in SCT:
- Restore of backup made on any previous Scylla version.
@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?
Test is using latest releases
The problem is that the restore nemesis is building on top backup made a long time ago with older versions of scylla and manager
Someone needs to recreate the backup, until then all restores would fail
Also the data only restore case is not covered in any nemesis.
Meanwhile we would disable those, until this issue is going to be resolved.
It's just wasting multiple people time
If this coverage is needed for manager, some should be assign to handle it.
@mikliapko - is the original flow covered in one of the SM specific jobs? If so IMO the nemesis should use the latest scylla and SM releases. @fruch wdyt?
Yes, we have a coverage for:
- Restore of backup within one version of Scylla and Manager (Manager sanity test);
- Data restore from backup made by previous Manager version (Manager upgrade test).
- Releases are tested only after the release, and only one release is targeted.
- those are not happening during high utilization of the cluster
- we need the basic manger actions working as nemesis, for example scylla latency during ops isn't covering backup/restore
The kind of test we are missing in SCT:
- Restore of backup made on any previous Scylla version.
You mean data only ?
The kind of test we are missing in SCT:
- Restore of backup made on any previous Scylla version.
You mean data only ?
I meant both types. But as I got from your previous message there are nemesis tests that cover some schema restore cases. I was not aware of them.
- those are not happening during high utilization of the cluster
- we need the basic manger actions working as nemesis, for example scylla latency during ops isn't covering backup/restore
Actually, I'd agree that these are very good things to test but we don't have them covered by tests. @rayakurl FYI
@mikliapko please use this issue and fix the current nemesis to either restore data only or replace the backup that this nemesis uses with a newer one that supports schema restore.
@mikliapko please restore the schema with a backup that supports this flow.
@roydahan @fruch @ShlomiBalalis I'm about to prepare the new snapshots for restoration instead of current. Just wondering how the previous snapshots were prepared. Do we have some auxiliary scripts, jobs, tests which may help with that to not implement them again? Maybe you know something about it?
@roydahan @fruch @ShlomiBalalis I'm about to prepare the new snapshots for restoration instead of current. Just wondering how the previous snapshots were prepared. Do we have some auxiliary scripts, jobs, tests which may help with that to not implement them again? Maybe you know something about it?
if it's not documented in those: https://github.com/scylladb/qa-tasks/issues/1128 https://github.com/scylladb/scylla-cluster-tests/pull/6083
you'll have to peek @ShlomiBalalis brain, for the actual information on how they were prepared.
AFAIR, it doesn't really matter, as long as it's not restoring keyspace1 or any keyspace we're using as main stress for our longevities.
So, I've updated snapshots and re-executed the longevity-twcs-3h-test
test.
It did restore schema and data operations successfully but failed on running verification stress:
2024-08-08 13:23:21.822: (CassandraStressEvent Severity.ERROR) period_type=end event_id=2d5a6c50-00b9-431d-8418-4c5ad0d2ca55 duration=9s: node=Node longevity-twcs-3h-2024-2-m-loader-node-663e2012-2 [34.232.109.10 | 10.12.3.244]
stress_cmd=cassandra-stress read cl=QUORUM n=3579200 -schema 'keyspace=10gb_sizetiered_2024_2_0_rc1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=3579201..7158400
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
java.lang.RuntimeException: Failed to execute
Argus link
@fruch Could you please help to investigate it?
So, I've updated snapshots and re-executed the
longevity-twcs-3h-test
test. It did restore schema and data operations successfully but failed on running verification stress:2024-08-08 13:23:21.822: (CassandraStressEvent Severity.ERROR) period_type=end event_id=2d5a6c50-00b9-431d-8418-4c5ad0d2ca55 duration=9s: node=Node longevity-twcs-3h-2024-2-m-loader-node-663e2012-2 [34.232.109.10 | 10.12.3.244] stress_cmd=cassandra-stress read cl=QUORUM n=3579200 -schema 'keyspace=10gb_sizetiered_2024_2_0_rc1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=3579201..7158400 errors: Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats java.lang.RuntimeException: Failed to execute
Argus link
@fruch Could you please help to investigate it?
the nemesis failed:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5213, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2979, in disrupt_mgmt_restore
assert is_passed, (
AssertionError: Data verification stress command, triggered by the 'mgmt_restore' nemesis, has failed
and all of the stress command failed, opening the loader logs, we can see those errors:
com.datastax.driver.core.exceptions.InvalidQueryException: Unrecognized name C1
at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:50)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
at com.datastax.driver.core.AbstractSession.prepare(AbstractSession.java:86)
at org.apache.cassandra.stress.util.JavaDriverClient.prepare(JavaDriverClient.java:124)
at org.apache.cassandra.stress.operations.predefined.CqlOperation$JavaDriverWrapper.createPreparedStatement(CqlOperation.java:318)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:77)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
so I can guess you did write with only one column, which the verification stress command expected 16 columns ?
see defaults/manager_persistent_snapshots.yaml
:
confirmation_stress_template: "cassandra-stress read cl=QUORUM n={num_of_rows} -schema 'keyspace={keyspace_name} replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq={sequence_start}..{sequence_end}"
I guess your write command should match this command, for things to work, or the other way around, the verification command should match the command use for writing the data.
I guess your write command should match this command, for things to work, or the other way around, the verification command should match the command use for writing the data.
I see, yeah, there are actually some differences in the way I prepared the data in write cmd and the way c-s read is executed. Will work on fixing, thanks!
Reproduced again in 2024.1.9 https://argus.scylladb.com/test/e3fc93ef-9efc-4ed1-85df-bf913acb2f73/runs?additionalRuns[]=4a0f40dd-b79c-4c74-b68f-355033fe31ac
@mikliapko after the fix was backported and merged to 2024.1 branch today, the original issue is not reproduced (on 2024.1.10), but this one popped out (during disrupt_mgmt_restore nemesis):
Command: 'sudo sctool restore -c ee868eb4-785f-4f2b-9bef-800373d5e84d --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1 --snapshot-tag sm_20240812150753UTC'
Exit code: 1
Stdout:
Stderr:
Error: get resource: create restore target, units and views: init units: get tombstone_gc of system_distributed.dicts: not found
Trace ID: mHhVeTgWSbORLt2BYYpIcg (grep in scylla-manager logs)
Does it look familiar? Maybe a post-effect of the fix I'm struggling to understand where the new issue should be reported - for SCT, or maybe for manager.
@mikliapko after the fix was backported and merged to 2024.1 branch today, the original issue is not reproduced (on 2024.1.10), but this one popped out (during disrupt_mgmt_restore nemesis):
Command: 'sudo sctool restore -c ee868eb4-785f-4f2b-9bef-800373d5e84d --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1 --snapshot-tag sm_20240812150753UTC' Exit code: 1 Stdout: Stderr: Error: get resource: create restore target, units and views: init units: get tombstone_gc of system_distributed.dicts: not found Trace ID: mHhVeTgWSbORLt2BYYpIcg (grep in scylla-manager logs)
Does it look familiar? Maybe a post-effect of the fix I'm struggling to understand where the new issue should be reported - for SCT, or maybe for manager.
We faced the same issue in our Manager tests but only in case of Scylla 2024.2.0-rc2. So, I'm about to open the ticket in Scylla. It seems like the API call failed on Scylla side:
19T13:23:57.915Z","N":"cluster.client","M":"HTTP","host":"10.12.3.110:10001","method":"GET","uri":"/agent/node_info","duration":"8ms","status":500,"bytes":102,"dump":"HTTP/1.1 500 Internal Server Error\r\nContent-Length: 102\r\nContent-Type: application/json\r\nDate: Thu, 19 Sep 2024 13:23:57 GMT\r\n\r\n{\"message\":\"unexpected error, consult logs: version info: invalid
But in your case you are running the test with Scylla 2024.1.10. Hm, probably it was a mistake to backport it into 2024.1 since the original backup we restore from was created with Scylla 2024.2.0-rc1 where desc schema issue is fixed. I'm not sure we have a valid test configuration trying to restore backup created with Scylla 2024.2.0-rc1 on cluster with 2024.1.10. @Michal-Leszczynski What do you think?
@mikliapko could you describe the test scenario? In general, it's not possible to restore schema from backup <= 5.4 into cluster >= 6.0. After taking a look at the implementation, it is also not possible to do it the other way around - backup >= 6.0 into cluster <= 5.4 - ~although I think that we could make it possible, it's just missing from the implementation~.