zos
zos copied to clipboard
Controlled rollout of ZOS on mainnet
We need to implement a Controlled rollout for ZOS upgrades, specially on mainnet to facilitate controlled experiments on different nodes or farms. The primary goal is to allow for testing the next version of ZOS on selected nodes or farms without impacting the entire network. This will be crucial for evaluating new features or optimizations before rolling them out to the broader network.
- We should be able to define the mainnet farms list to be used in the A/B testing
- We should be able to define the mainnet nodes list to be used in the A/B testing
safe_to_upgrade_networkdefaulted tofalse: This flag will be used to indicate whether it is safe to proceed with network-wide upgrades with the latest zos version specified on chain or not
TODO...
Doesn't mean to be picky on the wording part.
We need to implement A/B Testing functionality for ZOS upgrades to facilitate controlled experiments on different nodes or farms. The primary goal is to allow for testing the next version of ZOS on selected nodes or farms without impacting the entire network
But i think what you really mean is staging, not A/B testing @xmonader ?
The main difference is for staging, we really want that the feature/version to be finally deployed.
While in A/B testing, we want to choose one between two or more alternatives.
This will be crucial for evaluating new features or optimizations before rolling them out to the broader network.
Other than staging, we can also employ feature flag/toggle technique:
- activate the feature for several controlled users/nodes
- deactivate when things go wrong.
We have already qanet, testnet as staging environments for the release in pipeline, what is needed is controlled rollout on a small, defined, subset of nodes on mainnet.
Oh okay, feature flag/toggle then
Flag is already toggled/set as part of zos upgrade. When we want to upgrade nodes on mainnet network, we create a proposal - on tfchain - that has a zos version to upgrade the whole network to, and as soon as the node picks up that proposal approval it starts its upgrade. What is needed is breaking that into two steps:
- changing the version of zos on chain
- controlled rollout, by allowing some of the nodes to reflect as soon as they become aware of a new version and manual approval to propagate across the network - after testing on that's first batch of nodes-.
controlled rollout
OK, so it is clear that we want is controlled rollout.
Farm IDs can be included in the A/B testing but we can't include node IDs. We can't get the ID of the node before the registration.
Farm IDs can be included in the A/B testing but we can't include node IDs. We can't get the ID of the node before the registration.
If the node isn't registered, it's not part of the allowed nodes list by design, no?
I mean the registration/noded module in general even the node is registered. This step is known after the identityd module (which is the one responsible for the upgrade)
Alright, let's remove the nodes list and stick to farm ids only
Do you want to use the node address instead? or just farm IDs will be enough?
I think farm IDs are enough, addresses are too cumbersome IMO
WIP:
- waiting qa release to test
- used zos configs repo for the rollout configs
Testing in progress
- Will be tested in the next qa release
The current situation is we have a config file which specify if the version is safe to upgrade or not https://github.com/threefoldtech/zos-config/blob/main/development-v4.json#L68
this will introduce inconsistency since we are adding the version to the chain and on each change on the chain we need to change this file. a better solution is to include safe_to_upgrade in the version itself something like
{"version": "3.15", "safe_to_upgrade": false}
this will be passed while creating the motion to upgrade the environment.
and we keep the farms to be tested first on the zos-config
this will require a very small change in ts_client and go_clients
Verified multiple times