scylla-cluster-tests
scylla-cluster-tests copied to clipboard
fix(upgrade): upgrade with raft topology procedure
After upgrade to latest master(6.0) raft topology feature or tablets + raft topology features will be enabled by default To switch cluster from gossiper to raft topology, raft topology procedure should be executed. It is described here: https://github.com/scylladb/scylladb/blob/c5601a749e21fc710958a7c84316ecdf5943022c/docs/dev/topology-over-raft.md section: Upgrade from legacy topology to raft-based topology
parameter 'enable_tablets_on_upgradeis used to control upgrade with/without tablets. By default tablets are enabled with scylla parameter
enable_tablets` and it is added to scylla.yaml starting from 6.0. Upon upgrade this parameter should be set explicitly to true to enable tablets or set to false or not be added to scylla.yaml so tablets stay disabled after upgrade
raft topology is enabled by default, and could be disabled only with parameter enforce_gossip_topology_changes
. if this parameter is added to scylla.yaml before upgrade, then raft feature: consistent topology changes will not be enabled
Testing
run upgrade from 5.4 -> 6.0.
PR pre-checks (self review)
- [ ] I added the relevant
backport
labels - [ ] I didn't leave commented-out/debugging code
Reminders
- Add New configuration option and document them (in
sdcm/sct_config.py
) - Add unit tests to cover my changes (under
unit-test/
folder) - Update the Readme/doc folder relevent to this change (if needed)
blocked by https://github.com/scylladb/scylladb/issues/18098
I'm unassigning myself -- it is blocked by https://github.com/scylladb/scylladb/issues/18098 but that issue does not "belong" to me
Should be no longer -- @aleksbykov please retest
(on nightly build -- the fix is not backported to 6.0 yet)
@soyacz , @yarongilor, @fruch , @roydahan can you review please. Here is a short description what should be tested and how.
Two major new features were introduced in the 6.0 and 6.1 releases:
-
tablets
-
raft topology (consistent_topology_changes).
The 'raft topology changes' feature is enabled by default for any new cluster. No parameter related to this feature could be used in the scylla.yaml file. To disable the raft topology feature for a new cluster we need to add a new parameter to scylla.yaml: 'force_gossip_topology_changes: true'. To enable the feature after upgrade from versions where 'raft topology feature' is missed or from version where it was disabled, we need manually trigger 'raft topology upgrade procedure'.
It's important to note that once the upgrade procedure for enabling the raft topology has been performed, there is no way to revert back to the gossip topology.
The tablets feature is disabled by default and depends on the raft topology feature. If the raft topology feature is disabled, then tablets cannot be enabled independently.
For new clusters (scylla version >= 6.0), it is enabled via adding 'enable_tablets: true' to scylla.yaml If a cluster was created with the disabled tablets feature or was upgraded from version < 6.0, then tablets are disabled.
We need to support the following upgrade paths because the sct master branch could be used with different versions and to safely backport to 6.0 and enterprise:
- 5.4 -> 6.0,
- 5.4->2024.2.dev,
- 6.0->6.1.dev,
- 2024.1 -> 2024.2.dev
Feature state per versions:
5.4, 2024.1 - doesn't have tablets and raft topology. 6.0+ - could have tablets and raft topology in different states: enabled/disabled.
Upgrade from versions 5.4, 2024.1 -> 6.0, 2024.2 could be done with the following options:
- raft topology disabled after upgrade. For that, we need to add a new parameter force_gossip_topology_changes:true and not run the raft topology upgrade procedure after all nodes have been upgraded. In this case, tablets should not be enabled at all. No need to add anything to scylla.yaml.
- raft topology enabled after upgrade. No need to add anything to scylla.yaml and run the raft topology upgrade procedure after all nodes have been upgraded.
- tablets feature is not enabled after upgrade. No need to add anything to scylla.yaml (because tablets are disabled by default). Raft topology feature is default as in points (1,2).
- tablets feature is enabled after upgrade. Before node upgrade, 'enable_tablets: true' should be added to scylla.yaml and if the raft topology were disabled by parameter, it should be removed. And after all nodes have been upgraded, run the raft topology upgrade procedure.
Upgrade from 6.0 -> to 6.1 and enterprise
default scenarios:
- raft topology feature already enabled. in this case we can't disable it upon upgrade and should run regular upgrade
- tablets feature enabled in this case features couldn't be disabled after upgrade and should run as is
- raft topology feature was disabled and it will be enabled after upgrade the force_gossip_topology_changes could be removed from scylla.yaml before upgrade or could stay and raft topology upgrade procedure have to be executed after all nodes has been upgraded
- raft topology feature was disabled and after upgrade it should stay disabled: nothing should be done with scylla.yaml and raft topology upgrade procedure shouldn't be executed after all nodes has been upgraded
- tablets feature is disabled and should be enabled after upgrade scylla.yaml should be updated with 'enable_tablets: true' before upgrade and raft topology upgrade procedure should be run after all node has been upgraded
- tablets feature is disabled and should be disabled after upgrade nothing should be done with scylla.yaml and after upgrade
to support all these paths 2 sct_config parameters will be used:
- enable_force_gossip_topology_changes_on_upgrade. Default value is false. This parameter is used to trigger raft topology upgrade procedure
- enable_tablets_on_upgrade. Default value is true update scylla.yaml with 'enable_tablets: true'
and appropriate jobs should have appropriate 'scylla_yaml_append' parameter:
- to disable tablets before upgrade: - scylla_yaml_append: enable_tablets: false
- to disable raft topology feature before upgrade: - scylla_yaml_append: force_gossip_topology_changes: true enable_tablets: false force_gossip_topology_changes: true
combination of these 4th parameters should allow to support all possible configurations.
Upgrade from 5.4 -> 6.0 Default upgrade with enable raft topology feature and tablets feature: enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: true no additional `append_scylla_yaml' parameters which is default for rolling upgrades
upgrade will be run and after upgrade raft toplogy feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be enabled via scylla.yaml
Upgrade from 5.4 -> 6.0 Default upgrade with enable raft topology feature and tablets feature: enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: false no additional `append_scylla_yaml' parameters which is default for rolling upgrades
upgrade will be run and after upgrade raft topology feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be disabled(no in scylla.yaml and default is false)
Upgrade from 6.0->6.1 default path: raft topology feature enabled tablets enabled in scylla.yaml. enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: false
upgrade will be run with both features enabled
Upgrade from 6.0->6.1 default path: raft topology feature enabled tablets disabled in scylla.yaml. append_scylla_yaml: enable_tablets: false enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: true
upgrade will be run with enabled raft topology feature, before upgrade scylla.yaml will be updated with enable_tablets: true
@aleksbykov
@soyacz , @yarongilor, @fruch , @roydahan can you review please. Here is a short description what should be tested and how.
Two major new features were introduced in the 6.0 and 6.1 releases:
* tablets * raft topology (consistent_topology_changes). The 'raft topology changes' feature is enabled by default for any new cluster. No parameter related to this feature could be used in the scylla.yaml file. To disable the raft topology feature for a new cluster we need to add a new parameter to scylla.yaml: 'force_gossip_topology_changes: true'. To enable the feature after upgrade from versions where 'raft topology feature' is missed or from version where it was disabled, we need manually trigger 'raft topology upgrade procedure'. It's important to note that once the upgrade procedure for enabling the raft topology has been performed, there is no way to revert back to the gossip topology. The tablets feature is disabled by default and depends on the raft topology feature. If the raft topology feature is disabled, then tablets cannot be enabled independently. For new clusters (scylla version >= 6.0), it is enabled via adding 'enable_tablets: true' to scylla.yaml If a cluster was created with the disabled tablets feature or was upgraded from version < 6.0, then tablets are disabled. We need to support the following upgrade paths because the sct master branch could be used with different versions and to safely backport to 6.0 and enterprise: 1. 5.4 -> 6.0, 2. 5.4->2024.2.dev, 3. 6.0->6.1.dev, 4. 2024.1 -> 2024.2.dev Feature state per versions: 5.4, 2024.1 - doesn't have tablets and raft topology. 6.0+ - could have tablets and raft topology in different states: enabled/disabled. Upgrade from versions 5.4, 2024.1 -> 6.0, 2024.2 could be done with the following options: 1. raft topology disabled after upgrade. For that, we need to add a new parameter force_gossip_topology_changes:true and not run the raft topology upgrade procedure after all nodes have been upgraded. In this case, tablets should not be enabled at all. No need to add anything to scylla.yaml. 2. raft topology enabled after upgrade. No need to add anything to scylla.yaml and run the raft topology upgrade procedure after all nodes have been upgraded.
force_gossip_topology_changes
does not change anything in upgraded cluster.
force_gossip_topology_changes
affects only first node bootstrap in 6.0+. It does nothing in existing cluster. (cc @patjed41 -- correct me if I'm wrong)
When you upgrade from 5.4/2024.1 to 6.0/2024.2, raft topology is disabled -- and the only way to enable it is to trigger the raft topology upgrade procedure.
3. tablets feature is not enabled after upgrade. No need to add anything to scylla.yaml (because tablets are disabled by default). Raft topology feature is default as in points (1,2). 4. tablets feature is enabled after upgrade. Before node upgrade, 'enable_tablets: true' should be added to scylla.yaml and if the raft topology were disabled by parameter, it should be removed. And after all nodes have been upgraded, run the raft topology upgrade procedure.
Adding enable_tablets:true
before version upgrade does not affect anything, I think. The cluster continues running in gossip mode (and hence without tablets) until you trigger the raft topology upgrade procedure.
But you can add it before or after triggering the raft topology upgrade procedure. If you add it before, then I guess tablets should automatically get enabled (whatever that means) in the cluster together with raft topology upgrade. If you add it after, then they will become enabled at that point.
cc @bhalevy -- please confirm
Upgrade from 6.0 -> to 6.1 and enterprise default scenarios: 1. raft topology feature already enabled. in this case we can't disable it upon upgrade and should run regular upgrade 2. tablets feature enabled in this case features couldn't be disabled after upgrade and should run as is 3. raft topology feature was disabled and it will be enabled after upgrade the force_gossip_topology_changes could be removed from scylla.yaml before upgrade or could stay and raft topology upgrade procedure have to be executed after all nodes has been upgraded 4. raft topology feature was disabled and after upgrade it should stay disabled: nothing should be done with scylla.yaml and raft topology upgrade procedure shouldn't be executed after all nodes has been upgraded 5. tablets feature is disabled and should be enabled after upgrade scylla.yaml should be updated with 'enable_tablets: true' before upgrade and raft topology upgrade procedure should be run after all node has been upgraded 6. tablets feature is disabled and should be disabled after upgrade nothing should be done with scylla.yaml and after upgrade
Basically enabling raft topology or tablets is completely orthogonal to the 6.0 -> 6.1 version upgrade. You can do any permutation of [upgrade to raft topo, enable tablets, upgrade to 6.1] when starting from 6.0.
to support all these paths 2 sct_config parameters will be used: * enable_force_gossip_topology_changes_on_upgrade. Default value is false. This parameter is used to trigger raft topology upgrade procedure * enable_tablets_on_upgrade. Default value is true update scylla.yaml with 'enable_tablets: true' and appropriate jobs should have appropriate 'scylla_yaml_append' parameter: 1. to disable tablets before upgrade: - scylla_yaml_append: enable_tablets: false 2. to disable raft topology feature before upgrade: - scylla_yaml_append: force_gossip_topology_changes: true enable_tablets: false force_gossip_topology_changes: true combination of these 4th parameters should allow to support all possible configurations. Upgrade from 5.4 -> 6.0 Default upgrade with enable raft topology feature and tablets feature: enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: true no additional `append_scylla_yaml' parameters which is default for rolling upgrades upgrade will be run and after upgrade raft toplogy feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be enabled via scylla.yaml Upgrade from 5.4 -> 6.0 Default upgrade with enable raft topology feature and tablets feature: enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: false no additional `append_scylla_yaml' parameters which is default for rolling upgrades upgrade will be run and after upgrade raft topology feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be disabled(no in scylla.yaml and default is false) Upgrade from 6.0->6.1 default path: raft topology feature enabled tablets enabled in scylla.yaml. enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: false upgrade will be run with both features enabled Upgrade from 6.0->6.1 default path: raft topology feature enabled tablets disabled in scylla.yaml. append_scylla_yaml: enable_tablets: false enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: true upgrade will be run with enabled raft topology feature, before upgrade scylla.yaml will be updated with enable_tablets: true
force_gossip_topology_changes
does not change anything in upgraded cluster.force_gossip_topology_changes
affects only first node bootstrap in 6.0+. It does nothing in existing cluster. (cc @patjed41 -- correct me if I'm wrong)
Yes, force_gossip_topology_changes
affects only the first node. Non-first nodes:
- will fail to boot (join or restart) with
force_gossip_topology_changes=true
if the cluster uses the raft topology, - will boot and use the gossip topology regardless of the
force_gossip_topology_changes
value if the cluster uses the gossip topology.
force_gossip_topology_changes
does not change anything in upgraded cluster.force_gossip_topology_changes
affects only first node bootstrap in 6.0+. It does nothing in existing cluster. (cc @patjed41 -- correct me if I'm wrong)Yes,
force_gossip_topology_changes
affects only the first node. Non-first nodes:
- will fail to boot (join or restart) with
force_gossip_topology_changes=true
if the cluster uses the raft topology,- will boot and use the gossip topology regardless of the
force_gossip_topology_changes
value if the cluster uses the gossip topology.
@kbr-scylla
I do think using force gossip for: 6.0->6.1.dev
isn't a flow that needs to be covered ?
Well, in theory, force_gossip_topology_changes
is only a flag for testing -- so we can continue testing gossip-based topology changes in 6.0.
We don't advertise that we have such a flag to our users and customers; the official message is that raft-topology is enabled in new 6.0 clusters and mandatory.
So maybe indeed it doesn't make sense to test upgrades from "6.0 booted in gossip mode".
cc @mykaul @tzach
So what decision we will make:
- Do we always run raft topology upgrade procedure after upgrade?
- gossip mode could be enabled by force_enable_gossip_topology in scylla.yaml, but after upgrade to 6.1, enable raft topology with p.1?
- tablets could be enabled/disabled on base verstion (6.0+) and should be enabled with enable_tablets_on_upgrade sct parameter after upgrade?
@kbr-scylla , @fruch , @soyacz , @mykaul
- Do we always run raft topology upgrade procedure after upgrade?
Yes... We want to move clusters (and customers) to Raft.
- gossip mode could be enabled by force_enable_gossip_topology in scylla.yaml, but after upgrade to 6.1, enable raft topology with p.1?
- tablets could be enabled/disabled on base verstion (6.0+) and should be enabled with enable_tablets_on_upgrade sct parameter after upgrade?
Not sure I understand the question.
@aleksbykov looks like you edited Yaniv's post instead of answering to it I edited it back, posting your question below:
@mykaul , the question is next: If for some reason, cluster 6.0, 2024.2 was created with gossip topology mode, ( the only way for new cluster is by adding force_gossip_topology_changes:true to scylla.yaml), then we run upgrade to 6.1, Enterprise and run raft topology procedure after upgrade, so cluster will have raft topology enabled.
so path 6.0 (gossip topology) -> 6.1(gossip topology) is not relevant and not a case for testing and case for customer???
And regarding tablets. we should support next upgrade paths for 6.0->6.1: Before upgrade -> after upgrade tablets disabled -> tables disabled tablets disabled -> tablets enabled - default path tablets enabled -> tablets enabled - default path tablets enabled -> tablets disabled - this is not relevant ???
And my answer to the first part:
so path 6.0 (gossip topology) -> 6.1(gossip topology) is not relevant and not a case for testing and case for customer???
I think it's not relevant. force_gossip_topology_changes
is not an officially supported option, so we can assume that new 6.0 clusters use topology mode. If you upgraded from 5.4 then the documentation says to perform the upgrade-to-raft-topology after.
If someone decides to stay in gossip mode in 6.0, I would say it's on them. We don't support it.
@mykaul please confirm.
@aleksbykov looks like you edited Yaniv's post instead of answering to it I edited it back, posting your question below:
@mykaul , the question is next: If for some reason, cluster 6.0, 2024.2 was created with gossip topology mode, ( the only way for new cluster is by adding force_gossip_topology_changes:true to scylla.yaml), then we run upgrade to 6.1, Enterprise and run raft topology procedure after upgrade, so cluster will have raft topology enabled.
so path 6.0 (gossip topology) -> 6.1(gossip topology) is not relevant and not a case for testing and case for customer???
And regarding tablets. we should support next upgrade paths for 6.0->6.1: Before upgrade -> after upgrade tablets disabled -> tables disabled tablets disabled -> tablets enabled - default path tablets enabled -> tablets enabled - default path tablets enabled -> tablets disabled - this is not relevant ???
I don't think we should invest effort in the paths that lead to tablets disabled.
@soyacz , @yarongilor can you review New:
- remove force-gossip-topology-changes-on-upgrade from config.
- raft topology feature have to be enabled after upgrade. if any node is not support the topology_consistent_changes feature, raft upgrade procedure will not be run, if after upgrade all nodes supports the feature, raft upgrade procedure will be executed.
- Tablets controlled by sct parameter and scylla.yaml parameter
Job passed:
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-with-enable-tablets-on-upgrade-test/9
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-ami-test/26
Job are passed:
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-with-disabled-raft-topology-on-upgrade-test/11
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-ami-test/28
- https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-with-enable-tablets-on-upgrade-test/11
@soyacz @fruch can you take a look?