scylla-cluster-tests icon indicating copy to clipboard operation
scylla-cluster-tests copied to clipboard

fix(upgrade): upgrade with raft topology procedure

Open aleksbykov opened this issue 11 months ago • 18 comments

After upgrade to latest master(6.0) raft topology feature or tablets + raft topology features will be enabled by default To switch cluster from gossiper to raft topology, raft topology procedure should be executed. It is described here: https://github.com/scylladb/scylladb/blob/c5601a749e21fc710958a7c84316ecdf5943022c/docs/dev/topology-over-raft.md section: Upgrade from legacy topology to raft-based topology

parameter 'enable_tablets_on_upgradeis used to control upgrade with/without tablets. By default tablets are enabled with scylla parameterenable_tablets` and it is added to scylla.yaml starting from 6.0. Upon upgrade this parameter should be set explicitly to true to enable tablets or set to false or not be added to scylla.yaml so tablets stay disabled after upgrade

raft topology is enabled by default, and could be disabled only with parameter enforce_gossip_topology_changes. if this parameter is added to scylla.yaml before upgrade, then raft feature: consistent topology changes will not be enabled

Testing

run upgrade from 5.4 -> 6.0.

PR pre-checks (self review)

  • [ ] I added the relevant backport labels
  • [ ] I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevent to this change (if needed)

aleksbykov avatar Mar 25 '24 09:03 aleksbykov

blocked by https://github.com/scylladb/scylladb/issues/18098

temichus avatar Apr 24 '24 11:04 temichus

I'm unassigning myself -- it is blocked by https://github.com/scylladb/scylladb/issues/18098 but that issue does not "belong" to me

kbr-scylla avatar May 16 '24 10:05 kbr-scylla

blocked by scylladb/scylladb#18098

Is this still the case?

mykaul avatar May 27 '24 07:05 mykaul

Should be no longer -- @aleksbykov please retest

(on nightly build -- the fix is not backported to 6.0 yet)

kbr-scylla avatar May 27 '24 15:05 kbr-scylla

@soyacz , @yarongilor, @fruch , @roydahan can you review please. Here is a short description what should be tested and how.

Two major new features were introduced in the 6.0 and 6.1 releases:

  • tablets

  • raft topology (consistent_topology_changes).

    The 'raft topology changes' feature is enabled by default for any new cluster. No parameter related to this feature could be used in the scylla.yaml file. To disable the raft topology feature for a new cluster we need to add a new parameter to scylla.yaml: 'force_gossip_topology_changes: true'. To enable the feature after upgrade from versions where 'raft topology feature' is missed or from version where it was disabled, we need manually trigger 'raft topology upgrade procedure'.

    It's important to note that once the upgrade procedure for enabling the raft topology has been performed, there is no way to revert back to the gossip topology.

    The tablets feature is disabled by default and depends on the raft topology feature. If the raft topology feature is disabled, then tablets cannot be enabled independently.

    For new clusters (scylla version >= 6.0), it is enabled via adding 'enable_tablets: true' to scylla.yaml If a cluster was created with the disabled tablets feature or was upgraded from version < 6.0, then tablets are disabled.

    We need to support the following upgrade paths because the sct master branch could be used with different versions and to safely backport to 6.0 and enterprise:

    1. 5.4 -> 6.0,
    2. 5.4->2024.2.dev,
    3. 6.0->6.1.dev,
    4. 2024.1 -> 2024.2.dev

    Feature state per versions:

    5.4, 2024.1 - doesn't have tablets and raft topology. 6.0+ - could have tablets and raft topology in different states: enabled/disabled.

    Upgrade from versions 5.4, 2024.1 -> 6.0, 2024.2 could be done with the following options:

    1. raft topology disabled after upgrade. For that, we need to add a new parameter force_gossip_topology_changes:true and not run the raft topology upgrade procedure after all nodes have been upgraded. In this case, tablets should not be enabled at all. No need to add anything to scylla.yaml.
    2. raft topology enabled after upgrade. No need to add anything to scylla.yaml and run the raft topology upgrade procedure after all nodes have been upgraded.
    3. tablets feature is not enabled after upgrade. No need to add anything to scylla.yaml (because tablets are disabled by default). Raft topology feature is default as in points (1,2).
    4. tablets feature is enabled after upgrade. Before node upgrade, 'enable_tablets: true' should be added to scylla.yaml and if the raft topology were disabled by parameter, it should be removed. And after all nodes have been upgraded, run the raft topology upgrade procedure.

    Upgrade from 6.0 -> to 6.1 and enterprise

    default scenarios:

    1. raft topology feature already enabled. in this case we can't disable it upon upgrade and should run regular upgrade
    2. tablets feature enabled in this case features couldn't be disabled after upgrade and should run as is
    3. raft topology feature was disabled and it will be enabled after upgrade the force_gossip_topology_changes could be removed from scylla.yaml before upgrade or could stay and raft topology upgrade procedure have to be executed after all nodes has been upgraded
    4. raft topology feature was disabled and after upgrade it should stay disabled: nothing should be done with scylla.yaml and raft topology upgrade procedure shouldn't be executed after all nodes has been upgraded
    5. tablets feature is disabled and should be enabled after upgrade scylla.yaml should be updated with 'enable_tablets: true' before upgrade and raft topology upgrade procedure should be run after all node has been upgraded
    6. tablets feature is disabled and should be disabled after upgrade nothing should be done with scylla.yaml and after upgrade

    to support all these paths 2 sct_config parameters will be used:

    • enable_force_gossip_topology_changes_on_upgrade. Default value is false. This parameter is used to trigger raft topology upgrade procedure
    • enable_tablets_on_upgrade. Default value is true update scylla.yaml with 'enable_tablets: true'

    and appropriate jobs should have appropriate 'scylla_yaml_append' parameter:

    1. to disable tablets before upgrade: - scylla_yaml_append: enable_tablets: false
    2. to disable raft topology feature before upgrade: - scylla_yaml_append: force_gossip_topology_changes: true enable_tablets: false force_gossip_topology_changes: true

    combination of these 4th parameters should allow to support all possible configurations.

    Upgrade from 5.4 -> 6.0 Default upgrade with enable raft topology feature and tablets feature: enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: true no additional `append_scylla_yaml' parameters which is default for rolling upgrades

    upgrade will be run and after upgrade raft toplogy feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be enabled via scylla.yaml

    Upgrade from 5.4 -> 6.0 Default upgrade with enable raft topology feature and tablets feature: enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: false no additional `append_scylla_yaml' parameters which is default for rolling upgrades

    upgrade will be run and after upgrade raft topology feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be disabled(no in scylla.yaml and default is false)

    Upgrade from 6.0->6.1 default path: raft topology feature enabled tablets enabled in scylla.yaml. enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: false

    upgrade will be run with both features enabled

    Upgrade from 6.0->6.1 default path: raft topology feature enabled tablets disabled in scylla.yaml. append_scylla_yaml: enable_tablets: false enable_force_gossip_topology_on_upgrade: false enable_tablets_on_upgrade: true

    upgrade will be run with enabled raft topology feature, before upgrade scylla.yaml will be updated with enable_tablets: true

aleksbykov avatar Jun 26 '24 13:06 aleksbykov

@aleksbykov

@soyacz , @yarongilor, @fruch , @roydahan can you review please. Here is a short description what should be tested and how.

Two major new features were introduced in the 6.0 and 6.1 releases:

* tablets

* raft topology (consistent_topology_changes).
  The 'raft topology changes' feature is enabled by default for any new cluster.
  No parameter related to this feature could be used in the scylla.yaml file.
  To disable the raft topology feature for a new cluster we need to add a new parameter
  to scylla.yaml: 'force_gossip_topology_changes: true'.
  To enable the feature after upgrade from versions where 'raft topology feature' is missed
  or from version where it was disabled, we need manually trigger 'raft topology upgrade procedure'.
  It's important to note that once the upgrade procedure for enabling
  the raft topology has been performed, there is no way to revert back to the gossip topology.
  The tablets feature is disabled by default and depends on the raft topology feature.
  If the raft topology feature is disabled, then tablets cannot be enabled independently.
  For new clusters (scylla version >= 6.0), it is enabled via adding 'enable_tablets: true' to scylla.yaml
  If a cluster was created with the disabled tablets feature or was upgraded from version < 6.0,
  then tablets are disabled.
  We need to support the following upgrade paths because the sct master branch could be used
  with different versions and to safely backport to 6.0 and enterprise:
  
  1. 5.4 -> 6.0,
  2. 5.4->2024.2.dev,
  3. 6.0->6.1.dev,
  4. 2024.1 -> 2024.2.dev
  
  Feature state per versions:
  5.4, 2024.1 - doesn't have tablets and raft topology.
  6.0+ - could have tablets and raft topology in different states: enabled/disabled.
  Upgrade from versions 5.4, 2024.1 -> 6.0, 2024.2 could be done with the following options:
  
  1. raft topology disabled after upgrade.
     For that, we need to add a new parameter force_gossip_topology_changes:true and
     not run the raft topology upgrade procedure after all nodes have been upgraded.
     In this case, tablets should not be enabled at all. No need to add anything to scylla.yaml.
  2. raft topology enabled after upgrade.
     No need to add anything to scylla.yaml and run the raft topology upgrade procedure
     after all nodes have been upgraded.

force_gossip_topology_changes does not change anything in upgraded cluster. force_gossip_topology_changes affects only first node bootstrap in 6.0+. It does nothing in existing cluster. (cc @patjed41 -- correct me if I'm wrong)

When you upgrade from 5.4/2024.1 to 6.0/2024.2, raft topology is disabled -- and the only way to enable it is to trigger the raft topology upgrade procedure.

  3. tablets feature is not enabled after upgrade.
     No need to add anything to scylla.yaml (because tablets are disabled by default).
     Raft topology feature is default as in points (1,2).
  4. tablets feature is enabled after upgrade.
     Before node upgrade, 'enable_tablets: true' should be added to scylla.yaml and
     if the raft topology were disabled by parameter, it should be removed.
     And after all nodes have been upgraded, run the raft topology upgrade procedure.

Adding enable_tablets:true before version upgrade does not affect anything, I think. The cluster continues running in gossip mode (and hence without tablets) until you trigger the raft topology upgrade procedure.

But you can add it before or after triggering the raft topology upgrade procedure. If you add it before, then I guess tablets should automatically get enabled (whatever that means) in the cluster together with raft topology upgrade. If you add it after, then they will become enabled at that point.

cc @bhalevy -- please confirm

  Upgrade from 6.0 -> to 6.1 and enterprise
  default scenarios:
  
  1. raft topology feature already enabled.
     in this case we can't disable it upon upgrade and should run regular upgrade
  2. tablets feature enabled
     in this case features couldn't be disabled after upgrade and should run as is
  3. raft topology feature was disabled and it will be enabled after upgrade
     the force_gossip_topology_changes could be removed from scylla.yaml
     before upgrade or could stay and raft topology upgrade procedure have to be executed
     after all nodes has been upgraded
  4. raft topology feature was disabled and after upgrade it should stay disabled:
     nothing should be done with scylla.yaml and raft topology upgrade procedure shouldn't be executed
     after all nodes has been upgraded
  5. tablets feature is disabled and should be enabled after upgrade
     scylla.yaml should be updated with 'enable_tablets: true' before upgrade and
     raft topology upgrade procedure should be run after all node has been upgraded
  6. tablets feature is disabled and should be disabled after upgrade
     nothing should be done with scylla.yaml and after upgrade

Basically enabling raft topology or tablets is completely orthogonal to the 6.0 -> 6.1 version upgrade. You can do any permutation of [upgrade to raft topo, enable tablets, upgrade to 6.1] when starting from 6.0.

  to support all these paths 2 sct_config parameters will be used:
  
  * enable_force_gossip_topology_changes_on_upgrade. Default value is false.
    This parameter is used to trigger raft topology upgrade procedure
  * enable_tablets_on_upgrade. Default value is true
    update scylla.yaml with 'enable_tablets: true'
  
  and appropriate jobs should have appropriate 'scylla_yaml_append' parameter:
  
  1. to disable tablets before upgrade:
     - scylla_yaml_append:
     enable_tablets: false
  2. to disable raft topology feature before upgrade:
     - scylla_yaml_append:
     force_gossip_topology_changes: true
     enable_tablets: false
     force_gossip_topology_changes: true
  
  combination of these 4th parameters should allow to support all possible configurations.
  Upgrade from 5.4 -> 6.0
  Default upgrade with enable raft topology feature and tablets feature:
  enable_force_gossip_topology_on_upgrade: false
  enable_tablets_on_upgrade: true
  no additional `append_scylla_yaml' parameters which is default for rolling upgrades
  upgrade will be run and after upgrade raft toplogy feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be enabled via scylla.yaml
  Upgrade from 5.4 -> 6.0
  Default upgrade with enable raft topology feature and tablets feature:
  enable_force_gossip_topology_on_upgrade: false
  enable_tablets_on_upgrade: false
  no additional `append_scylla_yaml' parameters which is default for rolling upgrades
  upgrade will be run and after upgrade raft topology feature will be enabled, raft topology upgrade procedure will be executed and tablets feature will be disabled(no in scylla.yaml and default is false)
  Upgrade from 6.0->6.1
  default path: raft topology feature enabled tablets enabled in scylla.yaml.
  enable_force_gossip_topology_on_upgrade: false
  enable_tablets_on_upgrade: false
  upgrade will be run with both features enabled
  Upgrade from 6.0->6.1
  default path: raft topology feature enabled tablets disabled in scylla.yaml.
  append_scylla_yaml: enable_tablets: false
  enable_force_gossip_topology_on_upgrade: false
  enable_tablets_on_upgrade: true
  upgrade will be run with enabled raft topology feature, before upgrade scylla.yaml will be updated with enable_tablets: true

kbr-scylla avatar Jun 28 '24 12:06 kbr-scylla

force_gossip_topology_changes does not change anything in upgraded cluster. force_gossip_topology_changes affects only first node bootstrap in 6.0+. It does nothing in existing cluster. (cc @patjed41 -- correct me if I'm wrong)

Yes, force_gossip_topology_changes affects only the first node. Non-first nodes:

  • will fail to boot (join or restart) with force_gossip_topology_changes=true if the cluster uses the raft topology,
  • will boot and use the gossip topology regardless of the force_gossip_topology_changes value if the cluster uses the gossip topology.

patjed41 avatar Jul 01 '24 07:07 patjed41

force_gossip_topology_changes does not change anything in upgraded cluster. force_gossip_topology_changes affects only first node bootstrap in 6.0+. It does nothing in existing cluster. (cc @patjed41 -- correct me if I'm wrong)

Yes, force_gossip_topology_changes affects only the first node. Non-first nodes:

  • will fail to boot (join or restart) with force_gossip_topology_changes=true if the cluster uses the raft topology,
  • will boot and use the gossip topology regardless of the force_gossip_topology_changes value if the cluster uses the gossip topology.

@kbr-scylla

I do think using force gossip for: 6.0->6.1.dev

isn't a flow that needs to be covered ?

fruch avatar Jul 02 '24 06:07 fruch

Well, in theory, force_gossip_topology_changes is only a flag for testing -- so we can continue testing gossip-based topology changes in 6.0.

We don't advertise that we have such a flag to our users and customers; the official message is that raft-topology is enabled in new 6.0 clusters and mandatory.

So maybe indeed it doesn't make sense to test upgrades from "6.0 booted in gossip mode".

cc @mykaul @tzach

kbr-scylla avatar Jul 02 '24 13:07 kbr-scylla

So what decision we will make:

  1. Do we always run raft topology upgrade procedure after upgrade?
  2. gossip mode could be enabled by force_enable_gossip_topology in scylla.yaml, but after upgrade to 6.1, enable raft topology with p.1?
  3. tablets could be enabled/disabled on base verstion (6.0+) and should be enabled with enable_tablets_on_upgrade sct parameter after upgrade?

@kbr-scylla , @fruch , @soyacz , @mykaul

aleksbykov avatar Jul 03 '24 11:07 aleksbykov

  1. Do we always run raft topology upgrade procedure after upgrade?

Yes... We want to move clusters (and customers) to Raft.

  1. gossip mode could be enabled by force_enable_gossip_topology in scylla.yaml, but after upgrade to 6.1, enable raft topology with p.1?
  2. tablets could be enabled/disabled on base verstion (6.0+) and should be enabled with enable_tablets_on_upgrade sct parameter after upgrade?

Not sure I understand the question.

mykaul avatar Jul 03 '24 13:07 mykaul

@aleksbykov looks like you edited Yaniv's post instead of answering to it I edited it back, posting your question below:


@mykaul , the question is next: If for some reason, cluster 6.0, 2024.2 was created with gossip topology mode, ( the only way for new cluster is by adding force_gossip_topology_changes:true to scylla.yaml), then we run upgrade to 6.1, Enterprise and run raft topology procedure after upgrade, so cluster will have raft topology enabled.

so path 6.0 (gossip topology) -> 6.1(gossip topology) is not relevant and not a case for testing and case for customer???

And regarding tablets. we should support next upgrade paths for 6.0->6.1: Before upgrade -> after upgrade tablets disabled -> tables disabled tablets disabled -> tablets enabled - default path tablets enabled -> tablets enabled - default path tablets enabled -> tablets disabled - this is not relevant ???

kbr-scylla avatar Jul 04 '24 09:07 kbr-scylla

And my answer to the first part:

so path 6.0 (gossip topology) -> 6.1(gossip topology) is not relevant and not a case for testing and case for customer???

I think it's not relevant. force_gossip_topology_changes is not an officially supported option, so we can assume that new 6.0 clusters use topology mode. If you upgraded from 5.4 then the documentation says to perform the upgrade-to-raft-topology after.

If someone decides to stay in gossip mode in 6.0, I would say it's on them. We don't support it.

@mykaul please confirm.

kbr-scylla avatar Jul 04 '24 09:07 kbr-scylla

@aleksbykov looks like you edited Yaniv's post instead of answering to it I edited it back, posting your question below:

@mykaul , the question is next: If for some reason, cluster 6.0, 2024.2 was created with gossip topology mode, ( the only way for new cluster is by adding force_gossip_topology_changes:true to scylla.yaml), then we run upgrade to 6.1, Enterprise and run raft topology procedure after upgrade, so cluster will have raft topology enabled.

so path 6.0 (gossip topology) -> 6.1(gossip topology) is not relevant and not a case for testing and case for customer???

And regarding tablets. we should support next upgrade paths for 6.0->6.1: Before upgrade -> after upgrade tablets disabled -> tables disabled tablets disabled -> tablets enabled - default path tablets enabled -> tablets enabled - default path tablets enabled -> tablets disabled - this is not relevant ???

I don't think we should invest effort in the paths that lead to tablets disabled.

mykaul avatar Jul 04 '24 10:07 mykaul

@soyacz , @yarongilor can you review New:

  1. remove force-gossip-topology-changes-on-upgrade from config.
  2. raft topology feature have to be enabled after upgrade. if any node is not support the topology_consistent_changes feature, raft upgrade procedure will not be run, if after upgrade all nodes supports the feature, raft upgrade procedure will be executed.
  3. Tablets controlled by sct parameter and scylla.yaml parameter

aleksbykov avatar Jul 09 '24 05:07 aleksbykov

Job passed:

  • https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-with-enable-tablets-on-upgrade-test/9
  • https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-ami-test/26

aleksbykov avatar Jul 09 '24 05:07 aleksbykov

Job are passed:

  • https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-with-disabled-raft-topology-on-upgrade-test/11
  • https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-ami-test/28
  • https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/abykov/job/rolling-upgrade-staging-abykov/job/rolling-upgrade-with-enable-tablets-on-upgrade-test/11

aleksbykov avatar Jul 30 '24 14:07 aleksbykov

@soyacz @fruch can you take a look?

aleksbykov avatar Jul 30 '24 14:07 aleksbykov