scylla-manager
scylla-manager copied to clipboard
After a successful schema and data restoration *to a different region*, the restored keyspace is completely empty
Issue description
- [ ] This issue is a regression.
- [x] It is unknown if this issue is a regression.
At 2023-08-14 13:27:09,663
, we started two restore tasks that uses a pre-created snapshot, that includes the keyspace 5gb_sizetiered_2022_1
.
First, a task to restore the schema:
< t:2023-08-14 13:27:13,826 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c a92d1307-4ac0-43df-874a-98667733d8ae --restore-schema --location s3:manager-backup-tests-permanent-snapshots-us-east-1 --snapshot-tag sm_20230702201949UTC" finished with status 0
< t:2023-08-14 13:27:13,826 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > sctool output: restore/256d69cd-92e9-49d7-bed5-e82928acf970
The restore task has ended successfully:
< t:2023-08-14 13:28:17,197 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool -c a92d1307-4ac0-43df-874a-98667733d8ae progress restore/256d69cd-92e9-49d7-bed5-e82928acf970" finished with status 0
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > sctool output: Restore progress
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Run: 4779b377-3aa6-11ee-a65d-0afbd2966d0b
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Status: DONE - restart required (see restore docs)
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Start time: 14 Aug 23 13:27:13 UTC
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > End time: 14 Aug 23 13:28:07 UTC
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Duration: 54s
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Progress: 100% | 100%
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Snapshot Tag: sm_20230702201949UTC
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG >
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╭───────────────┬─────────────┬──────────┬──────────┬────────────┬────────╮
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ Keyspace │ Progress │ Size │ Success │ Downloaded │ Failed │
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────┼─────────────┼──────────┼──────────┼────────────┼────────┤
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_schema │ 100% | 100% │ 474.478k │ 474.478k │ 474.478k │ 0 │
< t:2023-08-14 13:28:17,197 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╰───────────────┴─────────────┴──────────┴──────────┴────────────┴────────╯
At which point, restart all of the nodes (' services) in the cluster, one by one:
< t:2023-08-14 13:28:18,164 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:28:18,539 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:28:18+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1 !NOTICE | sudo[13833]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:29:41,945 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:29:42,734 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:29:43,110 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:29:43+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1 !NOTICE | sudo[13875]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:29:47,335 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:30:49,093 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:30:49,149 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:30:49+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-2 !NOTICE | sudo[11111]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:32:15,063 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:32:15,403 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:32:15,846 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:32:15+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-2 !NOTICE | sudo[11168]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:32:20,003 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:33:21,198 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:33:21,638 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:33:21+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-3 !NOTICE | sudo[11148]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:34:46,992 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:34:47,310 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:34:47,687 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:34:47+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-3 !NOTICE | sudo[11199]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:34:51,981 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
< t:2023-08-14 13:35:53,665 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl stop scylla-server.service"...
< t:2023-08-14 13:35:54,077 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:35:53+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-4 !NOTICE | sudo[11277]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl stop scylla-server.service
< t:2023-08-14 13:37:09,635 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl stop scylla-server.service" finished with status 0
< t:2023-08-14 13:37:10,549 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
< t:2023-08-14 13:37:11,016 f:db_log_reader.py l:114 c:sdcm.db_log_reader p:DEBUG > 2023-08-14T13:37:10+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-4 !NOTICE | sudo[11324]: scyllaadm : TTY=unknown ; PWD=/home/scyllaadm ; USER=root ; COMMAND=/usr/bin/systemctl start scylla-server.service
< t:2023-08-14 13:37:15,151 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo systemctl start scylla-server.service" finished with status 0
Afterwards, werestore the data:
< t:2023-08-14 13:38:20,276 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool restore -c a92d1307-4ac0-43df-874a-98667733d8ae --restore-tables --location s3:manager-backup-tests-permanent-snapshots-us-east-1 --snapshot-tag sm_20230702201949UTC" finished with status 0
< t:2023-08-14 13:38:20,282 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > sctool output: restore/ba67ff65-3170-4aa8-af74-efa2b694d89f
Which also passed:
< t:2023-08-14 13:44:08,432 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool -c a92d1307-4ac0-43df-874a-98667733d8ae progress restore/ba67ff65-3170-4aa8-af74-efa2b694d89f" finished with status 0
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > sctool output: Restore progress
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Run: d4fef9a2-3aa7-11ee-a65e-0afbd2966d0b
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Status: DONE
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Start time: 14 Aug 23 13:38:20 UTC
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > End time: 14 Aug 23 13:43:47 UTC
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Duration: 5m27s
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Progress: 100% | 100%
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Snapshot Tag: sm_20230702201949UTC
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG >
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╭───────────────────────┬─────────────┬─────────┬─────────┬────────────┬────────╮
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ Keyspace │ Progress │ Size │ Success │ Downloaded │ Failed │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────────────┼─────────────┼─────────┼─────────┼────────────┼────────┤
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ 100% │ 0 │ 0 │ 0 │ 0 │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ 5gb_sizetiered_2022_1 │ 100% | 100% │ 17.133G │ 17.133G │ 17.133G │ 0 │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ 100% | 100% │ 26.021k │ 26.021k │ 26.021k │ 0 │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ 100% │ 0 │ 0 │ 0 │ 0 │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ audit │ 100% │ 0 │ 0 │ 0 │ 0 │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╰───────────────────────┴─────────────┴─────────┴─────────┴────────────┴────────╯
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG >
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Post-restore repair progress:
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Run: d4fef9a2-3aa7-11ee-a65e-0afbd2966d0b
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Status: DONE
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Start time: 14 Aug 23 13:38:20 UTC
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > End time: 14 Aug 23 13:43:47 UTC
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Duration: 5m27s
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Progress: 100%
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Datacenters:
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > - eu-west
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG >
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╭────────────────────┬────────────────────────┬──────────┬──────────╮
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ Keyspace │ Table │ Progress │ Duration │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ role_attributes │ 100% │ 4s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ role_members │ 100% │ 4s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ roles │ 100% │ 4s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ service_levels │ 100% │ 6s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ view_build_status │ 100% │ 5s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├────────────────────┼────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ events │ 100% │ 12s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ node_slow_log │ 100% │ 4s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ node_slow_log_time_idx │ 100% │ 2s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ sessions │ 100% │ 2s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ sessions_time_idx │ 100% │ 2s │
< t:2023-08-14 13:44:08,432 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╰────────────────────┴────────────────────────┴──────────┴──────────╯
Afterwards, we also created a general repair task (since this code was not adjusted to the autmatic repair just yet):
< t:2023-08-14 13:44:11,117 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool repair -c a92d1307-4ac0-43df-874a-98667733d8ae" finished with status 0
< t:2023-08-14 13:44:11,117 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > sctool output: repair/9d82ba8f-053c-41f8-83dd-798d2e49bf4a
Which passed:
< t:2023-08-14 13:49:24,031 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo sctool -c a92d1307-4ac0-43df-874a-98667733d8ae progress repair/9d82ba8f-053c-41f8-83dd-798d2e49bf4a" finished with status 0
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > sctool output: Run: a5d19efb-3aa8-11ee-a661-0afbd2966d0b
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Status: DONE
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Start time: 14 Aug 23 13:44:10 UTC
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > End time: 14 Aug 23 13:48:58 UTC
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Duration: 4m47s
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Progress: 100%
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > Datacenters:
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > - eu-west
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG >
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╭───────────────────────────────┬────────────────────────────────┬──────────┬──────────╮
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ Keyspace │ Table │ Progress │ Duration │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ keyspace1 │ standard1 │ 100% │ 3m39s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ role_attributes │ 100% │ 1s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ role_members │ 100% │ 1s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ role_permissions │ 100% │ 1s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_auth │ roles │ 100% │ 1s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed_everywhere │ cdc_generation_descriptions_v2 │ 100% │ 0s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ cdc_generation_timestamps │ 100% │ 5s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ cdc_streams_descriptions_v2 │ 100% │ 5s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ service_levels │ 100% │ 6s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_distributed │ view_build_status │ 100% │ 5s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ├───────────────────────────────┼────────────────────────────────┼──────────┼──────────┤
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ events │ 100% │ 17s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ node_slow_log │ 100% │ 5s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ node_slow_log_time_idx │ 100% │ 3s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ sessions │ 100% │ 3s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > │ system_traces │ sessions_time_idx │ 100% │ 3s │
< t:2023-08-14 13:49:24,036 f:cli.py l:1122 c:sdcm.mgmt.cli p:DEBUG > ╰───────────────────────────────┴────────────────────────────────┴──────────┴──────────╯
Then, We executed a cassandra-stress to validate the data, which was DOA:
< t:2023-08-14 13:50:12,080 f:stress_thread.py l:287 c:sdcm.stress_thread p:INFO > cassandra-stress read no-warmup cl=QUORUM n=5242880 -schema 'keyspace=5gb_sizetiered_2022_1 replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native user=cassandra password=cassandra -rate threads=50 -col 'size=FIXED(64) n=FIXED(16)' -pop seq=1..5242880 -transport 'truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra' -node 10.4.3.146,10.4.0.171,10.4.0.236,10.4.0.248 -errors skip-unsupported-columns
type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb
WARN 13:50:19,052 Not using advanced port-based shard awareness with /10.4.0.171:9042 because we're missing port-based shard awareness port on the server
WARN 13:50:19,222 Not using advanced port-based shard awareness with /10.4.0.236:9042 because we're missing port-based shard awareness port on the server
java.io.IOException: Operation x10 on key(s) [4c4c3637324b38334f30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
Failed to connect over JMX; not collecting these stats
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
java.io.IOException: Operation x10 on key(s) [343550504e4f30353430]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.Operation.error(Operation.java:141)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 0 alive)
at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
Looking into the data folders in the machines as well, it seems that they are completely empty:
scyllaadm@longevity-200gb-48h-verify-limited--db-node-84dfb4de-1:/var/lib/scylla/data$ ll 5gb_sizetiered_2022_1/standard1-e08b7420191411ee8ec98425b74f1f5d/
total 0
drwxr-xr-x 4 scylla scylla 47 Aug 14 13:29 ./
drwxr-xr-x 3 scylla scylla 64 Aug 14 13:29 ../
drwxr-xr-x 2 scylla scylla 10 Aug 14 13:29 staging/
drwxr-xr-x 2 scylla scylla 10 Aug 14 13:41 upload/
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1040-aws
Scylla version (or git commit hash): 2022.2.12-20230727.f4448d5b0265
with build-id a87bfeb65d24abf65d074a3ba2e5b9664692d716
Cluster size: 4 nodes (i3.4xlarge)
Scylla Nodes used in this run:
- longevity-200gb-48h-verify-limited--db-node-84dfb4de-4 (34.253.41.169 | 10.4.0.248) (shards: 14)
- longevity-200gb-48h-verify-limited--db-node-84dfb4de-3 (34.245.91.14 | 10.4.0.236) (shards: 14)
- longevity-200gb-48h-verify-limited--db-node-84dfb4de-2 (34.245.188.222 | 10.4.0.171) (shards: 14)
- longevity-200gb-48h-verify-limited--db-node-84dfb4de-1 (52.214.177.207 | 10.4.3.146) (shards: 14)
OS / Image: ami-0624755b4db06e567
(aws: eu-west-1)
Test: longevity-200gb-48h-test_restore-nemesis
Test id: 84dfb4de-0573-4a01-8806-8b832bcafd91
Test name: scylla-staging/Shlomo/longevity-200gb-48h-test_restore-nemesis
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 84dfb4de-0573-4a01-8806-8b832bcafd91
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 84dfb4de-0573-4a01-8806-8b832bcafd91
Logs:
- db-cluster-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/db-cluster-84dfb4de.tar.gz
- sct-runner-events-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-runner-events-84dfb4de.tar.gz
- sct-84dfb4de.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/sct-84dfb4de.log.tar.gz
- loader-set-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/loader-set-84dfb4de.tar.gz
- monitor-set-84dfb4de.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/84dfb4de-0573-4a01-8806-8b832bcafd91/20230814_140710/monitor-set-84dfb4de.tar.gz
@ShlomiBalalis - where can I find the manager log, so we can see what was restored?
Out of curiosity, why is it using LCS?
2023-08-14T14:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1 !INFO | scylla[13934]: [shard 12] LeveledManifest - Leveled compaction strategy is restoring invariant of level 1 by compacting 2 sstables on behalf of keyspace1.standard1
@ShlomiBalalis - where can I find the manager log, so we can see what was restored?
the server is in the monitor tarball, the agents are in the db nodes
Out of curiosity, why is it using LCS?
2023-08-14T14:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-84dfb4de-1 !INFO | scylla[13934]: [shard 12] LeveledManifest - Leveled compaction strategy is restoring invariant of level 1 by compacting 2 sstables on behalf of keyspace1.standard1
This is simply part of the longevity scenario, but this is not the problematic keyspace anyway
So the logs show that some data has actually been downloaded and loaded to the cluster. The problem is that both automatic and manual repair (still present in this test scenario) didn't repair restored table.
So right now I'm checking if it's a restore or repair problem (tested version of SM does not contain repair refactor, so this is not connected to those changes).
Leading theory: I tried to restore the keyspace manually the old fasioned: downloading the sstables and refreshing, but we noticed something funny: At first, I was trying to query the keyspace right after the restore, consistently failing:
cassandra@cqlsh:5gb_sizetiered_2022_1> select * from standard1;
NoHostAvailable:
Then, I tried to change the replication factor of the keyspace, and noticed that while the region of the cluster under test is eu-west
:
$ nodetool status
Datacenter: eu-west
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.4.3.146 52.54 GB 256 ? c37bdb3d-7a3b-477a-b4fb-a4a98684a2c5 1a
UN 10.4.0.236 49.08 GB 256 ? 5d5bd234-9aec-4146-b3cb-b8e2e1729fa4 1a
UN 10.4.0.248 44.61 GB 256 ? 2747eaee-4803-490a-ad78-03467dd1f7cc 1a
UN 10.4.0.171 44.33 GB 256 ? 1434be9a-d258-4dd2-9579-2cec850786c1 1a
The keyspace was set to replicate in us-east, which is probably the region of the originally backed up cluster:
cassandra@cqlsh> SELECT * FROM system_schema.keyspaces;
keyspace_name | durable_writes | replication
-------------------------------+----------------+-------------------------------------------------------------------------------------
system_auth | True | {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '4'}
system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
keyspace1 | True | {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'}
system_distributed | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
audit | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
system_traces | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}
system_distributed_everywhere | True | {'class': 'org.apache.cassandra.locator.EverywhereStrategy'}
5gb_sizetiered_2022_1 | True | {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'}
Once I altered the region of the keyspace's region, I was able to query it just fine:
cassandra@cqlsh> ALTER KEYSPACE "5gb_sizetiered_2022_1" WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-west': '1'};
cassandra@cqlsh> use "5gb_sizetiered_2022_1";
cassandra@cqlsh:5gb_sizetiered_2022_1> select * from standard1;
key | C0 | C1 | C10 | C11 | C12 | C13 | C14 | C15 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9
------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------
0x343831364b4b33324e30 | 0x88cfad6776a64370624fca6a579b22909dd7d10500537a183446f9f429dea1d13fcb3716aba68f5023a974a1f6fef5fd3e3eb2856c8a08a38fc019de9fe45b90 | 0x68a108ecfcca46efb65d38269f75d23a1f43456a1cbe033baefc0cd2f3edbfa4289874dc57ae085e9cd830ae0644351a3c32c6d49140f81e714d715f2324cb75 | 0xa0a206e6889d81d48edda04842b35a248f3a608bd4588619ca39176f64ca53238913a404fba7ec3e67c071b35e13e2a39773610a5b541dc7a8f32cadc7c7eedf | 0xb7a13a1ad602380b4577dc5bb64865e54922862cf670bf288d3fb9afd69091477c623e9255e1d81068bd0707e01c0680cc306cb3693be8688c0db1c948ea38c3 | 0x7e1119d0cb8f34cd7141ba1ec7cf71eb64f254b0d46fc0f78b31fa3c1fe336eae57412dbaad94d4c728ca51140438d5e2521587f657d7dcfffdeeeb1218b2357 | 0x5dfd6b7923a1025f0085ecf43516aec54c25ced79dc5217267c060ddc927de711b0eec16116eeb2380f184bb1a7d6f9482bcdd1f4c75d7c4cacd42950746e4fa | 0x1ddb930018f516dc7e3ffafddc7ac358df4f3d2352931ae31982c55cc7e0d7dc9ea6067de7218a9e61f735f69ab3eb1ceb3b27e6300deee70c6c455cd20e6a14 | 0x6812d3acf717ba682498373953e77d64792930f029e5ab2b2c4a477098f9b49f0e6d35615e9e65b7736ee992ab3ff027227c73595e71f355b6b89e1ab1c7fb9d | 0x51ae72d57ce76acc6c69c90713f7d4fb9261efcfc833e73b30e383e70eb56aea4e11c2b51053e7479142041df5bd832fd6417e835a851378433e0de71bbdaee0 | 0xa9fa0041d3270b15f1700778bea29b99a7ea7c2172e338157ca41593f99e3a04a6a649c698bc01b888f6038b8740678554f41de84a3fb66390d300328068d204 | 0xec0375be958914ee5a7797c6921dfa0b309d95cf98fc9dd846dbfcc982d2ad0da27a7d17f7b1ff6c6fcca1c816fc47f5b96af5a50e1c28ae9e31351d250b6aab | 0x405d6d52c7782e3b8271a809e4138ece48bb4c0c203c65368008e778c23d1c2fe2a8105b89cf2141ddbb9090b1f69192af21afeba81c05d70880179a6300b745 | 0x3e4ed6b0621aa8ebfdf0035d417727357ef13ccc7e20bb8489f00dce99ce5b3690ccf2ec7759a4f0d5134fa3ac0471dad663a1a934cfc3cafe621f39dcf9c112 | 0xd046c4ed74a6821b7739342f48419f07b1a0d69175c239ccc0a504ddd0c440f02f233c9a898e2d59a3111479e0166cb4b7745b1322f9fddefd3bb197f8c60a34 | 0x553fff6a14fc45ba273ae9549962324615e90d9d79933eb7121eb5741c1c773da5503824f4d8b6584a4407cabbd5d6862f7eeb1bb76690fd14442834a45afd7f | 0xbb9b3a6ac0acfe1e67bb87870b45be7f9119439c691d39c8b69412c4ec8b513bcf6b4965af98e61711cb7da504b252cd716ce29a0d8772c3f1b89d349c467f8f
...
So, the difference in regions is probably the cause of the failures.
So, restore only works in the same region, and there is a procedure to restore to a different region? This is acceptable, but it needs to be explicitly documented.
Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical.
The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error.
In your example you said that you used nodetool refresh
for uploading sstables, but did you use it with the -las
option?
I'm curious if a work around in this case should look like:
- restore schema
- (change replication of keyspace with non-existing dc - but do we have a guarantee that the restore tables will work when using different keyspace schema?)
- restore tables
- (or maybe here is the right place for changing keyspace replication - perhaps uploaded data is still stored somewhere in the cluster and now it is safe to alter keyspace)
But at least we know that this issue is not a regression and that IMHO restore works as described in the docs.
Restoring tables has a requirement of having identical schema as in the backup. The dcs are also a part of the keyspace schema. So the fact that that restore does not work when you try to restore data into empty dc seems logical.
The strange part here is that load&stream does not complain when it has to upload sstables to nodes from empty dc (we can add manual checks for that in SM). I would suspect, that in this scenario uploaded sstables should be lost, as they don't belong to any node in the cluster, but maybe L&S still stores it somewhere, even though it's impossible to query the data because of the "unavailable replicas" error.
In your example you said that you used
nodetool refresh
for uploading sstables, but did you use it with the-las
option?
Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1
I'm curious if a work around in this case should look like:
* restore schema * (change replication of keyspace with non-existing dc - but do we have a guarantee that the restore tables will work when using different keyspace schema?) * restore tables * (or maybe here is the right place for changing keyspace replication - perhaps uploaded data is still stored somewhere in the cluster and now it is safe to alter keyspace)
In my case, I first loaded the data with refresh and only then altered the keyspace, and everything seemed fine afterwards (of course, it was only a preliminary check that the table contains data at all)
Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1
That's strange because nodetool refresh docs says:
Scylla node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.
So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:
- restore schema
- alter restored keyspace replication strategy (change dc names)
- restore data
seems more promising. @asias, do you think that this approach is safe and should work?
Context: We have a backup from some cluster with only dc1. We want to restore it to a different cluster with only dc2. Normally, SM would first restore all schema from the backup (this requires cluster restart) and then it would proceed with restoring non-schema SSTables via load&stream. The problem is that we restore SSTables into keyspace replicated only in dc1 and we don't have any nodes from this dc in restore destination cluster, so even though restore procedure ends "successfully", the data is not there. Is it safe to use load&steam on SSTables when backed-up and restore destination clusters have identical table schema, but have different keyspace schema (keyspace name is the same, but there are different dc names in replication strategies)?
Nope. a simple nodetool refresh -- 5gb_sizetiered_2022_1 standard1
That's strange because nodetool refresh docs says:
Scylla node will ignore the partitions in the sstables which are not assigned to this node. For example, if sstable are copied from a different node.
So I would expect that it worked partially / it's not reliable to use it in this way. So the approach with:
- restore schema
- alter restored keyspace replication strategy (change dc names)
- restore data
Yeah, regardless of the fact that it worked (and I agree, it's strange it worked at all) this is probably the correct course of action as far as I can tell
My local experiments confirms that the approach:
- restore schema
- alter restored keyspace replication strategy (change dc names)
- restore data
works fine, but they are just experiments and not proofs of reliability.
@ShlomiBalalis could we rerun this test scenario with the additional alter keyspace
step in the middle of both restores?
@ShlomiBalalis ping
@ShlomiBalalis ?
My local experiments confirms that the approach:
- restore schema
- alter restored keyspace replication strategy (change dc names)
- restore data
works fine, but they are just experiments and not proofs of reliability. @ShlomiBalalis could we rerun this test scenario with the additional
alter keyspace
step in the middle of both restores?
@Mark-Gurevich can you please take over this? If needed, let's open an issue in SCT to add this as workaround until this issue is fixed.
@Michal-Leszczynski mind taking ownership of this issue?
IIUC we need to add to the disrupt_mgmt_restore
nemesis code additional alter keyspace
in middle of both restores?
From a brief view of the code I didn't find where this can be added. Needs further deep dive.
@mikliapko is this something that you could take care of? I mean validating that procedure described in https://github.com/scylladb/scylla-manager/issues/3525#issuecomment-1693310241 works fine with some proper test. When it's validated, we can add it to SM docs.
Still happens: https://argus.scylladb.com/test/f1ff65fd-8324-4264-8d28-8c7122fca836/runs?additionalRuns[]=5986619f-8479-4267-a92f-19c6b604f84b
@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.
Yep, as it's still happening, I will take a look into it
@mikliapko is this something that you could take care of? I mean validating that procedure described in #3525 (comment) works fine with some proper test. When it's validated, we can add it to SM docs.
Yep, as it's still happening, I will take a look into it
@mikliapko it's happening in a test that disable raft topology, is the schema restore depended on raft topology ?
Packages
Scylla version: 6.3.0~dev-20240927.c17d35371846
with build-id a9b08d0ce1f3cf99eb39d7a8372848fa2840dc1d
Kernel Version: 6.8.0-1016-aws
Installation details
Cluster size: 5 nodes (i4i.8xlarge)
Scylla Nodes used in this run:
- longevity-mv-si-4d-master-db-node-34c4d009-9 (18.201.159.126 | 10.4.8.218) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-8 (54.170.27.136 | 10.4.11.177) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-7 (18.202.195.66 | 10.4.8.202) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-6 (3.252.132.22 | 10.4.9.92) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-5 (18.201.83.7 | 10.4.9.76) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-4 (34.244.233.144 | 10.4.8.101) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-3 (34.246.198.146 | 10.4.9.17) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-2 (54.216.167.207 | 10.4.11.30) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-11 (54.154.171.167 | 10.4.11.79) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-10 (3.249.103.86 | 10.4.9.9) (shards: 30)
- longevity-mv-si-4d-master-db-node-34c4d009-1 (34.245.208.246 | 10.4.11.237) (shards: 30)
OS / Image: ami-087d814d9b6773015
(aws: undefined_region)
Test: longevity-mv-si-4days-streaming-test
Test id: 34c4d009-73b1-490b-83e5-03f6705be5eb
Test name: scylla-master/tier1/longevity-mv-si-4days-streaming-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 34c4d009-73b1-490b-83e5-03f6705be5eb
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 34c4d009-73b1-490b-83e5-03f6705be5eb
Logs:
- longevity-mv-si-4d-master-db-node-34c4d009-4 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-4-34c4d009.tar.gz
- longevity-mv-si-4d-master-db-node-34c4d009-1 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-1-34c4d009.tar.gz
- longevity-mv-si-4d-master-db-node-34c4d009-6 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-6-34c4d009.tar.gz
- longevity-mv-si-4d-master-db-node-34c4d009-8 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-8-34c4d009.tar.gz
- longevity-mv-si-4d-master-db-node-34c4d009-3 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-3-34c4d009.tar.gz
- longevity-mv-si-4d-master-db-node-34c4d009-10 - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240928_030950/longevity-mv-si-4d-master-db-node-34c4d009-10-34c4d009.tar.gz
- db-cluster-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/db-cluster-34c4d009.tar.gz
- sct-runner-events-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-runner-events-34c4d009.tar.gz
- sct-34c4d009.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/sct-34c4d009.log.tar.gz
- loader-set-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/loader-set-34c4d009.tar.gz
- monitor-set-34c4d009.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/34c4d009-73b1-490b-83e5-03f6705be5eb/20240929_050633/monitor-set-34c4d009.tar.gz
Starting from SM 3.3 and Scylla 6.0, SM restores schema by applying the output of DESC SCHEMA WITH INTERNALS
.
The problem is the keyspace definition contains dc names - that's why this test fails with the following error:
"M":"Run ended with ERROR","task":"restore/09af96b8-68b1-4bf6-928b-7fd01aa266f4","status":"ERROR","cause":"restore data: create \"100gb_sizetiered_6_0\" (\"100gb_sizetiered_6_0\") with CREATE KEYSPACE \"100gb_sizetiered_6_0\" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '3'} AND durable_writes = true: Unrecognized strategy option {us-east} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 100gb_sizetiered_6_0","duration":"5.618998928s"
So right now this is a documented limitation, but we should make it possible to restore schema into a different DC setting or make it easier for the user to modify just the DC part of keyspace schema.
Created an issue for that: #4049.