zos icon indicating copy to clipboard operation
zos copied to clipboard

Controlled rollout of ZOS on mainnet

Open xmonader opened this issue 1 year ago • 14 comments

We need to implement a Controlled rollout for ZOS upgrades, specially on mainnet to facilitate controlled experiments on different nodes or farms. The primary goal is to allow for testing the next version of ZOS on selected nodes or farms without impacting the entire network. This will be crucial for evaluating new features or optimizations before rolling them out to the broader network.

  • We should be able to define the mainnet farms list to be used in the A/B testing
  • We should be able to define the mainnet nodes list to be used in the A/B testing
  • safe_to_upgrade_network defaulted to false: This flag will be used to indicate whether it is safe to proceed with network-wide upgrades with the latest zos version specified on chain or not

TODO...

xmonader avatar Sep 02 '24 15:09 xmonader

Doesn't mean to be picky on the wording part.

We need to implement A/B Testing functionality for ZOS upgrades to facilitate controlled experiments on different nodes or farms. The primary goal is to allow for testing the next version of ZOS on selected nodes or farms without impacting the entire network

But i think what you really mean is staging, not A/B testing @xmonader ? The main difference is for staging, we really want that the feature/version to be finally deployed. While in A/B testing, we want to choose one between two or more alternatives.

This will be crucial for evaluating new features or optimizations before rolling them out to the broader network.

Other than staging, we can also employ feature flag/toggle technique:

  • activate the feature for several controlled users/nodes
  • deactivate when things go wrong.

iwanbk avatar Sep 03 '24 02:09 iwanbk

We have already qanet, testnet as staging environments for the release in pipeline, what is needed is controlled rollout on a small, defined, subset of nodes on mainnet.

xmonader avatar Sep 03 '24 18:09 xmonader

Oh okay, feature flag/toggle then

iwanbk avatar Sep 04 '24 03:09 iwanbk

Flag is already toggled/set as part of zos upgrade. When we want to upgrade nodes on mainnet network, we create a proposal - on tfchain - that has a zos version to upgrade the whole network to, and as soon as the node picks up that proposal approval it starts its upgrade. What is needed is breaking that into two steps:

  • changing the version of zos on chain
  • controlled rollout, by allowing some of the nodes to reflect as soon as they become aware of a new version and manual approval to propagate across the network - after testing on that's first batch of nodes-.

xmonader avatar Sep 04 '24 06:09 xmonader

controlled rollout

OK, so it is clear that we want is controlled rollout.

iwanbk avatar Sep 04 '24 07:09 iwanbk

Farm IDs can be included in the A/B testing but we can't include node IDs. We can't get the ID of the node before the registration.

rawdaGastan avatar Sep 16 '24 08:09 rawdaGastan

Farm IDs can be included in the A/B testing but we can't include node IDs. We can't get the ID of the node before the registration.

If the node isn't registered, it's not part of the allowed nodes list by design, no?

xmonader avatar Sep 16 '24 08:09 xmonader

I mean the registration/noded module in general even the node is registered. This step is known after the identityd module (which is the one responsible for the upgrade)

rawdaGastan avatar Sep 16 '24 08:09 rawdaGastan

Alright, let's remove the nodes list and stick to farm ids only

xmonader avatar Sep 16 '24 08:09 xmonader

Do you want to use the node address instead? or just farm IDs will be enough?

rawdaGastan avatar Sep 16 '24 08:09 rawdaGastan

I think farm IDs are enough, addresses are too cumbersome IMO

xmonader avatar Sep 16 '24 08:09 xmonader

WIP:

  • waiting qa release to test
  • used zos configs repo for the rollout configs

rawdaGastan avatar Sep 18 '24 13:09 rawdaGastan

Testing in progress

  • Will be tested in the next qa release

rawdaGastan avatar Sep 25 '24 11:09 rawdaGastan

The current situation is we have a config file which specify if the version is safe to upgrade or not https://github.com/threefoldtech/zos-config/blob/main/development-v4.json#L68 this will introduce inconsistency since we are adding the version to the chain and on each change on the chain we need to change this file. a better solution is to include safe_to_upgrade in the version itself something like {"version": "3.15", "safe_to_upgrade": false} this will be passed while creating the motion to upgrade the environment. and we keep the farms to be tested first on the zos-config this will require a very small change in ts_client and go_clients

ashraffouda avatar Oct 02 '24 09:10 ashraffouda

Verified multiple times

xmonader avatar Nov 06 '24 11:11 xmonader