NEPs
NEPs copied to clipboard
Discussion: Upgrade triggering mechanism. Enabling EVM.
Motivation
We currently have EVM code that we would like to merge into master branch and test on betanet. However, it is not clear how we are going to conditionally enable this code and how it will be related to our protocol versions. For example, we might want to not associate this feature with any protocol versions for a while. This issue seems to not be unique for EVM feature. This issue suggests an approach to guarding and triggering features.
Proposal
Separate guarding and triggering features
Currently our code, conflates guarding features and triggering features, by having if PROTOCOL_VERSION >= X
conditions throughout our codebase. The first suggestion by @evgenykuzyakov is to separate guarding features and triggering features. We should replace all our conditions with if FEATURE_X_ENABLED
and add assert!(!(FEATURE_X_ENABLED && FEATURE_Y_ENABLED))
for mutually exclusive features. This will allow several things:
- Prevent feature triggering logic from being scatter across the codebase. Currently, it is possible to accidentally enable feature inconsistently by having
if PROTOCOL_VERSION >= X
in one place andif PROTOCOL_VERSION > X
in another. - Improve readability, since currently readers need to backward-engineer mutual exclusivity from convoluted code like this one: https://github.com/near/nearcore/blob/8136944b9cefa169a1422842d1c9a7254fddbf31/chain/chain/src/types.rs#L147-L163
One place for triggering features
@evgenykuzyakov suggested we introduce one place in our codebase where we have all logic for how features can be triggered. This can be a class that is either owned by the Client
and passed everywhere or a singleton. The class would allow access to boolean fields like FEATURE_X_ENABLED
.
Internally, this class will decide which features are enabled based on the current protocol version, enabled compiler flags, and other parameters, e.g. whether 80% of the network has updated to certain protocol version.
Introduce "nightly" protocol version
Similar to Rust versioning, X-nightly
protocol version would include all features enabled by X
version plus some extra features that we want to test on Betanet/Crashnet. Similarly to Rust, certain feature can be enabled in X-nightly
, then disabled in X+1-nightly
and then enabled in X+2-nightly
, e.g. if we had to turn it off temporarily.
Features enabled in X+1
are selected from the features enabled inX-nightly
.
Termination button
@evgenykuzyakov suggested we create certain voting contract for validators that they can use to urgently turn off certain features in our codebase.
EVM
We are going to guard EVM with feature evm
and enable it on protocol version >=39-nightly
(@bowenwang1996 please comment if you think this should be different). Then once we are satisfied with running EVM for a while, we will enable it on the stable version.
Special purpose Python scripts for Betanet
Since Betanet does not enjoy transaction traffic towards EVM, we need to write a Python script that would exercise EVM continuously. Writing special-purpose Python scripts might be a standard pattern for testing nightly features going forward, since large protocol changes need special conditions to be properly exercised.
Extra conditions
@evgenykuzyakov , @bowenwang1996 , @chefsale please comment if there are other extra conditions for EVM to be merged into master
branch and tested on Betanet.
@frol , @chefsale you might have prior experience on how to guard, trigger and test such upgrades. Please contribute your ideas.
The problem is that unlike rust compiler, our network is stateful, so what you have enabled before has an impact on what you can enable afterwards. For example, if we change gas price from 100 to 1000 in one protocol upgrade and for some reason it doesn't work and we have to disable it, we cannot just go back to the old version because we cannot stop all nodes precisely at the same time and therefore some nodes are likely to be one (or more) block behind when they restart with the old version, now they have to process a block produced in the new version with the old code, which will cause them to reject the block, and if enough of nodes are in that situation, the network stalls. The problem here is that reverting a version after it has been instantiated (no matter how we instantiate it, through flags or protocol upgrades) is difficult.
The problem here is that reverting a version after it has been instantiated (no matter how we instantiate it, through flags or protocol upgrades) is difficult.
This seems to be an orthogonal problem to the feature flags; at least the way I see the feature flags implementation.
I propose to have the following setup:
- Implement conditional behavior guarding certain features in our code based on some constant value (e.g. enum variant), so our code is not polluted with the decisions based on the protocol version. Pseudocode:
if FEATURES.is_enabled(VERSIONED_BLOCK_HEADER) {
...
}
- In a single place select the set of features based on the protocol version:
FEATURES.set(VERSIONED_BLOCK_HEADER, PROTOCOL_VERSION > 10);
This way, we avoid merge conflicts, and we can skip the second step on initial feature introduction (the implementation is merged first, and given the feature is not enabled, it will be just a dead code until later in time). Another benefit is that we can batch features into a single protocol version update by PRing a simple change:
FEATURES.set(VERSIONED_BLOCK_HEADER, PROTOCOL_VERSION > 10);
FEATURES.set(VERSIONED_CHUNK_HEADER, PROTOCOL_VERSION > 10);
FEATURES.set(PARALLEL_RUNTIMES, PROTOCOL_VERSION > 10);
The community can even vote which features to get enabled at the certain PROTOCOL_VERSION
Implementation-vise we can use bit-vec (we want to have a structure that allows to check whether some feature is enabled quickly and, unfortunatelly, bitflags and enumflags2 are limited to u64 [we will only be able to have 64 features]).
I agree with @frol and like this kind of feature flags approached, we used a very similar concept in a project I worked before on and this works very well and it is easy to use, maybe it would be good if we could just setup the feature flags in a configuration or pass them as flags or configuration which can be just changed on startup or even in runtime instead of having to recompile the whole binary every time.
So instead of setting in the code:
FEATURES.set(VERSIONED_BLOCK_HEADER, PROTOCOL_VERSION > 10);
you would have it in the config.json or somewhere as:
{
features: {
VERSIONED_BLOCK_HEADER: true, // or
ENABLED_VERSIONED_BLOCK_HEADER_VERSION: 10,
}
}
This also allows us to use than just to change the config and spawn validators or whole networks to see if the behaviour is what we expect it to be without recompiling the binary.
Also in the future this would allow us to do more experimentation and host a service for versioning and tweaking configs for different nodes or network internally when we are testing breaking changes.
This doesn't have to be used even just for new breaking changes but just changes or optimizations in current code where we aren't sure that it really works where we can run the old code alongside the new code and just have a flag:
if FEATURES.is_enabled(OPTIMIZED_CODE) {
new optimized code
} else {
old code
}
and we can either revert or just test new changes without really removing the old code making us to revert the PR.
I was thinking something like for syntax when @frol mentioned it:
when used:
if feature!(EVM_SUPPORTED) {
...
}
and then in version.rs defines new protocol version:
protocol_version!(40, vec![EVM_SUPPORT, ..]);
@chefsale I don't think we can have protocol in the config - I don't think there is a good way to go back or sometimes even change without recompilation.
I like the idea of separating features from protocol versions in all places of the code except one (as @ilblackdragon is suggesting above).
This makes the code much cleaner, but also helps us with the problem being discussed in Chain Sync today. @bowenwang1996 was looking for a way for block producers to roll back a feature which caused some problems by only changing their binaries. If protocol versions are associated with features in the binary, then rolling a feature back would be possible by having everyone change their binary to no longer associate that feature with the protocol version.
Or phrased another way, we re-define what protocol version X means. Initially we were saying that protocol version X means feature Y is enabled, but after some problem occurs we say, nevermind protocol version X does not include feature Y, thus rolling it back.
This makes the code much cleaner, but also helps us with the problem being discussed in Chain Sync today. @bowenwang1996 was looking for a way for block producers to roll back a feature which caused some problems by only changing their binaries.
@birchmd This problem still exists even with this change, because the issue is not about the features or the version supported by the binary, but the version that is agreed upon with consensus to be used in the chain.
For features that affect how consensus works what we are doing today is that we are releasing a new binary (with version bumped) that initially runs without that feature until more than 80% of the validators(stake) claim they are using the new version.
If we find out that the feature/version should be reverted for some reason (there are reasons more problematic than others, like the consensus is stalled) validators need to reach consensus again on using the previous version. For example, when the feature involves a layout change on some chain-related data structures such as blocks, validators must be able to disambiguate at any point which type of block is expected, and which one is invalid.
@mfornet I know we agree on what protocol version is used via an on-chain vote, but I thought the definition of what that protocol version means is left implicit (i.e. there is no on-chain definition as to what each protocol version change entails). So we can still have the chain stay on whatever version was last decided on while rolling back a feature by changing what it means to be at that protocol version.
You are right that changing data structures could make things tricky, but I think those could be addressed on a case-by-case basis. Ultimately we always have the hard-fork escape hatch if we are really stuck, but this type of binary-only change could perhaps still work sometimes.
@frol It seems to me that using feature flags solve the problem of choosing the changes that should go into the next protocol upgrade, which is great. However, it doesn't solve the issue of how we test it -- ideally you want features to have composability so that we can easily disable or enable them on some testing network. Sadly that is not the case, as I have mentioned above and it seems so far we don't have a good solution to that other than doing hard forks.
So we can still have the chain stay on whatever version was last decided on while rolling back a feature by changing what it means to be at that protocol version.
Assuming I understand your proposal correctly, it doesn't work, because after the vote passed, some blocks / chunks were already produced with the protocol version voted for and with the feature enabled. If we later change the correspondence of versions to features, such blocks and chunks will become invalid, because the new nodes trying to validate them will validate with the feature disabled.
So if we have to disable some feature, we need to increase the protocol version once again, so that the blocks / chunks produced so far have the version that has the feature enabled, and the blocks / chunks produced later have the version that has the feature disabled.
well I was about to comment but @SkidanovAlex said exactly what I wanted to say :)
Would a DB migration not be sufficient to make sure data produced during the time when the feature was enabled is corrected to match the old protocol version? The DB version is already separate from the protocol version (as it should be), so we can bump that independently.
I think the main issue will be how nodes running the binary with the feature enabled interact with those which have it disabled. But as long as the majority of stake change at roughly the same time this should not be too much of a problem. Having them all change at once introduces a service outage of course, but I thought the premise of this emergency rollback situation was that a critical issue was already causing a service outage in the first place.
Would a DB migration not be sufficient to make sure data produced during the time when the feature was enabled is corrected to match the old protocol version? The DB version is already separate from the protocol version (as it should be), so we can bump that independently.
I don't think DB migration is necessarily bijective.
I think the main issue will be how nodes running the binary with the feature enabled interact with those which have it disabled. But as long as the majority of stake change at roughly the same time this should not be too much of a problem. Having them all change at once introduces a service outage of course, but I thought the premise of this emergency rollback situation was that a critical issue was already causing a service outage in the first place.
I think emergency update is only one of the situations in consideration here. Even if all the node agree to stop at exactly the same time, due to time difference on different machines, it is likely that we end up in a situation where between 1/3 and 2/3 stop on one height and the rest stop on some other height, which will cause issues when they restart.
Summarizing the discussions and adding some of my ideas on how to proceed:
- It seems that we all agree that using feature flags to control what features to enable in the protocol upgrade. This allows us to have flexibility as to what features to include and can potentially have features in the codebase that need more testing.
- The question remains, as I posted above, how we can test the features, especially when we want to include or drop some new features. One simple idea is to do a hard fork on the testing network every time we want to change the feature set to avoid the upgrade issue discussed above. As brutal as it may sound, it may not be the end of the world. We have already planned to repurpose betanet to test protocol changes immediately and we can streamline the testing and hard forks on betanet. When we think the release is ready, we bump the protocol version and publish a release candidate which we run on testnet, and, after a short amount of testing, we can say that the new version is ready and publish as a new stable version. The upshot is that only on betanet will we do hard forks relatively frequently and on testnet, assuming things go well, we only have to do hard forks occasionally when things go wrong.
For clarity, I suggest we do not differentiate disabling feature and enabling feature. Specifically, if later we decide to disable VERSIONED_BLOCK_HEADER
we create feature DISABLE_VERSIONED_BLOCK_HEADER
Also, for clarity, I suggest we differentiate three types of feature rollout:
- Normal -- network is running normally. Validators are updating to protocol version
X
that has featureVERSIONED_BLOCK_HEADER
enabled and when there are 80% of validators with this protocol versionX
is triggered; - Stalled network -- network is stalled because of bad feature rollout, we communicate through social channels that validators need to restart their nodes with
--disable_feature VERSIONED_BLOCK_HEADER
or--disable_feature DISABLE_VERSIONED_BLOCK_HEADER
which resumes the network with socially amended versionX
(let's call itX'
). All blocks that were produced by nodes that run under amended version will includeAmended(VERSIONED_BLOCK_HEADER)
as the first special transaction. Then we release versionX+1
with disabled feature which validators upgrade to through a normal process; - Emergency -- network is just operating incorrectly. We communicate through social channels that validators need to vote through a contract
emergency_feature_amendment.disable(VERSIONED_BLOCK_HEADER)
. Once all validators observe that 80% of current validators voted for it featureVERSIONED_BLOCK_HEADER
is getting disabled inversion.rs
. Also think of it as socially amended versionX
(let's call itX'
). All blocks that were produced by nodes that run under amended version will includeAmended(VERSIONED_BLOCK_HEADER)
as the first special transaction. Then we release versionX+1
with disabled feature which validators upgrade to through a normal process;
@mfornet this seems to address your concern here: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707863416 and @bowenwang1996 this seems to answer your question how we make all nodes do it together here: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707440972
To test the feature we can start all nodes on testnet with --enable_feature VERSIONED_BLOCK_HEADER
which will amend the current version X
if X
does not include this feature, because it is under development. @bowenwang1996 this seems to address your question on how to test it: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707881183
@bowenwang1996 @birchmd
You are right that changing data structures could make things tricky, but I think those could be addressed on a case-by-case basis.
Could you give an example of data structure change feature that we cannot disabled using the above (1)-(3) methods?
@SkidanovAlex answering your question on how nodes will be able to re-validate past blocks that were produced with amended version: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707882421 When block is validated if the first transaction is Amended(VERSIONED_BLOCK_HEADER)
then validator knows that they need to disable this feature to be able to validate this block. It works for most of the features, maybe except the changes to the light client.
All blocks that were produced by nodes that run under amended version will include Amended(VERSIONED_BLOCK_HEADER) as the first special transaction. Then we release version X+1 with disabled feature which validators upgrade to through a normal process;
Are you referring to a transaction included in a block? If so, how do we make sure that the block is the first block that everyone applies?
We communicate through social channels that validators need to vote through a contract emergency_feature_amendment.disable(VERSIONED_BLOCK_HEADER)
Is this a smart contract on NEAR? If the chain is operating incorrectly I don't think we should keep using it for any purpose.
Just to sum up the solution we seem to arrive from the implementation point of view:
Use compile-time Rust features to guard nightly implementations defined in Cargo.toml and use stable-protocol
and nightly-protocol
feature groups.
stable-protocol = ["protocol-feature-versioned-block-header"]
nightly-protocol = ["stable-protocol", "protocol-feature-evm"]
Later, in some version.rs
file:
enum ProtocolFeatures {
#[feature("protocol-feature-versioned-block-header")]
VersionedBlockHeader,
#[feature("protocol-feature-evm")],
EVM,
}
#[feature("stable-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
41: [ProtocolFeatures::VersionedBlockHeader],
}
#[feature("nightly-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
39: [ProtocolFeatures::VersionedBlockHeader, ...],
...
42: [ProtocolFeatures::EVM, ...],
}
This way we can merge half-backed features into master and test them only when compiled with nightly-protocol
feature, and we can still build stable release from the same code-base without fear of that nightly code breaks anything on stable (it won't be there).
Using enums we have extra sanity checks (we don't need to manually keep some counter incremented like we have with the Cols in our storage), and the compiler will catch typos and other human errors.
/cc @SkidanovAlex
Thank you @frol !
@bowenwang1996 could you please lead (or delegate to someone) the implementation of version.rs
and the rest of the triggering mechanism needed for EVM? We could first do it for EVM, and postpone refactoring the current versioning code for later, so that we can rollout EVM faster.
When we introduce a new feature, do we increment the nightly protocol version as well? Basically if we have
#[feature("nightly-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
41: [ProtocolFeatures::VersionedShardChunkHeader, ...]
}
and when we introduce evm, do we change it to
#[feature("nightly-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
41: [ProtocolFeatures::VersionedShardChunkHeader, ...]
42: [ProtocolFeatures::VersionedShardChunkHeader, ProtocolFeatures::EVM, ..]
}
assuming that ProtocolFeatures::VersionedShardChunkHeader
is not stable yet.
This way it is easier to upgrade on the testing networks. Otherwise we have to do hard fork or figure out some other solution to upgrade to the new feature.
With borsh limit, we can only have 255 features. Just need to make sure we don't need to serialize this
When we introduce a new feature, do we increment the nightly protocol version as well? Basically if we have
My understanding from @frol is yes. we should always introduce nightly protocol version first, then bump stable protocol version to be some version <= latest nightly protocol version
Hm, given we want to iterate and fix things within one feature over some period of time. This means that it nightly will need to hard fork.
Which means I'm not sure we need for nightly protocol versions per se - just may be there is "nightly protocol version" which includes whatever features are needed.
Which means I'm not sure we need for nightly protocol versions per se - just may be there is "nightly protocol version" which includes whatever features are needed.
That means we have to do hard forks on every new feature. If we have versions at least some, if not a majority, of them could be avoided. Basically we only do hard forks if the new feature goes wrong and there is no way to fix it without changing the protocol.
Are we going to have a separate set of tests for nightly upgradability which covers all edge cases of changes for newly added features? E.g. if decoding of a single function call in EVM changes, hard fork is needed but there are no tests for checking this per se, and it's up to developer to catch that.
E.g. I don't think it make sense to upgrade nightly chain ever (unless a lot more work will be put in into intermediate changes) and doing hard fork on every release would totally make sense.
E.g. I don't think it make sense to upgrade nightly chain ever (unless a lot more work will be put in into intermediate changes) and doing hard fork on every release would totally make sense.
So you are saying if we change upgradability itself, it just won't get tested ?
I vote for that we only deploy nightly over stable, and don't bother with nightly to nightly migrations (dump some stable state, and recover from it [hard fork] -- do it on a network that we control, like crashnet). Having granual control over individual features (protocol_feature_evm
) we may introduce nightly_protocol_features
feature listing all of the nightly features while keep the nightly_protocol
feature just a marker feature to switch the protocol version mapping while the features listed in the protocol version mapping are feature-gated, so you can compile a nightly build with only a single nightly feature enabled for testing:
$ cargo build --features nightly_protocol --features protocol_feature_evm
If you want to build all the nightly features:
$ cargo build --features nightly_protocol_features
Cargo.toml:
protocol_feature_evm = []
nightly_protocol = [] # always empty, just a marker feature
nightly_protocol_features = ["nightly_protocol", "protocol_feature_evm", ...]
As an afterthought, I think it makes sense to only extend the stable protocol features, so instead of:
#[cfg(not(feature = "nightly_protocol"))]
FEATURES_MAPPING = ...
#[cfg(feature = "nightly_protocol")]
FEATURES_MAPPING = ...
I suggest we do:
pub const PROTOCOL_VERSION: ProtocolVersion = {
40 + if cfg!(feature = "nightly_protocol") { 1000 } else { 0 }
};
#[cfg(feature = "nightly_protocol")]
const NIGHTLY_PROTOCOL_VERSION = PROTOCOL_VERSION;
FEATURES_MAPPING = {
#[cfg(feature = "protocol_feature_evm")]
ProtocolFeatures::EVM => NIGHTLY_PROTOCOL_VERSION,
};
- Compiling
protocol_feature_evm
withoutnightly_protocol
will fail to compile (clearly indicating that it is a nightly feature). - Compiling
protocol_feature_evm
withnightly_protocol
only enables this single feature, so we can test it in isolation if needed - To promote the feature from nightly to stable you must add the
protocol_feature_evm
todefault
features and unless it is not assigned the specific protocol version number (instead ofNIGHTLY_PROTOCOL_VERSION
), it will fail to compile (which may prevent human errors).
/cc @bowenwang1996
The problem with always doing hard fork to compose features is that the upgrade path is never tested and from my experience, that has been the cause of problems several times in the past.
That makes sense. Let's stick to independent FEATURES_MAPPING then (though, I would add some compile-time sanity checks preventing stable having some features enabled without being enabled in nightly).
During a conversation with @evgenykuzyakov, I realized that there have been some miscommunications on how we are going to use nightly protocol features. I thought that they are going to be always implemented in a backward compatible way so that we can easily upgrade from enabling one feature to enabling 3 features without problem. However, @evgenykuzyakov brings up a good point that he wants to have fast iteration on evm and therefore needs to be able to make breaking changes easily. This, nonetheless, may lead to a number of issues (assuming that we are testing them on betanet first):
- We need to do a hard fork every time a new feature is introduced. Not only that, we also need to do a hard fork when more breaking changes are made to a feature after it is introduced.
- Since we do hard forks all the time, we still need to test the upgrade process for the features introduced before they are stabilized.
I think this is still achievable if we do the following:
- We increment
PROTOCOL_VERSION
guarded bynightly_protocol
every time a new feature is introduced or some breaking change is made to an existing feature. We also need to change the corresponding protocol versions inPROTOCOL_FEATURES_TO_VERSION_MAPPING
. Along with the change comes a hard fork. - When we are ready to make a release (assuming at this point the backward compatibility is addressed), we again increment
PROTOCOL_VERSION
and also protocol versions in the mapping without doing a hard fork. This way after the nodes restart they will upgrade to the new protocol by going through the normal upgrade process.
What do people think? @frol @SkidanovAlex @nearmax @ilblackdragon
Sounds good to me. My only ask is to have a good infrastructure for testing feature stabilization PRs making sure the migrations are fine on real data and older nodes can communicate with the new nodes (we already have some of it, but there is never enough tests)