Motivation

We currently have EVM code that we would like to merge into master branch and test on betanet. However, it is not clear how we are going to conditionally enable this code and how it will be related to our protocol versions. For example, we might want to not associate this feature with any protocol versions for a while. This issue seems to not be unique for EVM feature. This issue suggests an approach to guarding and triggering features.

Proposal

Separate guarding and triggering features

Currently our code, conflates guarding features and triggering features, by having if PROTOCOL_VERSION >= X conditions throughout our codebase. The first suggestion by @evgenykuzyakov is to separate guarding features and triggering features. We should replace all our conditions with if FEATURE_X_ENABLED and add assert!(!(FEATURE_X_ENABLED && FEATURE_Y_ENABLED)) for mutually exclusive features. This will allow several things:

Prevent feature triggering logic from being scatter across the codebase. Currently, it is possible to accidentally enable feature inconsistently by having if PROTOCOL_VERSION >= X in one place and if PROTOCOL_VERSION > X in another.
Improve readability, since currently readers need to backward-engineer mutual exclusivity from convoluted code like this one: https://github.com/near/nearcore/blob/8136944b9cefa169a1422842d1c9a7254fddbf31/chain/chain/src/types.rs#L147-L163

One place for triggering features

@evgenykuzyakov suggested we introduce one place in our codebase where we have all logic for how features can be triggered. This can be a class that is either owned by the Client and passed everywhere or a singleton. The class would allow access to boolean fields like FEATURE_X_ENABLED.

Internally, this class will decide which features are enabled based on the current protocol version, enabled compiler flags, and other parameters, e.g. whether 80% of the network has updated to certain protocol version.

Introduce "nightly" protocol version

Similar to Rust versioning, X-nightly protocol version would include all features enabled by X version plus some extra features that we want to test on Betanet/Crashnet. Similarly to Rust, certain feature can be enabled in X-nightly, then disabled in X+1-nightly and then enabled in X+2-nightly, e.g. if we had to turn it off temporarily.

Features enabled in X+1 are selected from the features enabled inX-nightly.

Termination button

@evgenykuzyakov suggested we create certain voting contract for validators that they can use to urgently turn off certain features in our codebase.

EVM

We are going to guard EVM with feature evm and enable it on protocol version >=39-nightly (@bowenwang1996 please comment if you think this should be different). Then once we are satisfied with running EVM for a while, we will enable it on the stable version.

Special purpose Python scripts for Betanet

Since Betanet does not enjoy transaction traffic towards EVM, we need to write a Python script that would exercise EVM continuously. Writing special-purpose Python scripts might be a standard pattern for testing nightly features going forward, since large protocol changes need special conditions to be properly exercised.

Extra conditions

@evgenykuzyakov , @bowenwang1996 , @chefsale please comment if there are other extra conditions for EVM to be merged into master branch and tested on Betanet.

Oct 13 '20 01:10 maxzaver

@frol , @chefsale you might have prior experience on how to guard, trigger and test such upgrades. Please contribute your ideas.

Oct 13 '20 01:10 maxzaver

The problem is that unlike rust compiler, our network is stateful, so what you have enabled before has an impact on what you can enable afterwards. For example, if we change gas price from 100 to 1000 in one protocol upgrade and for some reason it doesn't work and we have to disable it, we cannot just go back to the old version because we cannot stop all nodes precisely at the same time and therefore some nodes are likely to be one (or more) block behind when they restart with the old version, now they have to process a block produced in the new version with the old code, which will cause them to reject the block, and if enough of nodes are in that situation, the network stalls. The problem here is that reverting a version after it has been instantiated (no matter how we instantiate it, through flags or protocol upgrades) is difficult.

Oct 13 '20 02:10 bowenwang1996

The problem here is that reverting a version after it has been instantiated (no matter how we instantiate it, through flags or protocol upgrades) is difficult.

This seems to be an orthogonal problem to the feature flags; at least the way I see the feature flags implementation.

I propose to have the following setup:

Implement conditional behavior guarding certain features in our code based on some constant value (e.g. enum variant), so our code is not polluted with the decisions based on the protocol version. Pseudocode:

if FEATURES.is_enabled(VERSIONED_BLOCK_HEADER) {
    ...
}

In a single place select the set of features based on the protocol version:

FEATURES.set(VERSIONED_BLOCK_HEADER, PROTOCOL_VERSION > 10);

This way, we avoid merge conflicts, and we can skip the second step on initial feature introduction (the implementation is merged first, and given the feature is not enabled, it will be just a dead code until later in time). Another benefit is that we can batch features into a single protocol version update by PRing a simple change:

FEATURES.set(VERSIONED_BLOCK_HEADER, PROTOCOL_VERSION > 10);
FEATURES.set(VERSIONED_CHUNK_HEADER, PROTOCOL_VERSION > 10);
FEATURES.set(PARALLEL_RUNTIMES, PROTOCOL_VERSION > 10);

The community can even vote which features to get enabled at the certain PROTOCOL_VERSION

Implementation-vise we can use bit-vec (we want to have a structure that allows to check whether some feature is enabled quickly and, unfortunatelly, bitflags and enumflags2 are limited to u64 [we will only be able to have 64 features]).

Oct 13 '20 14:10 frol

I agree with @frol and like this kind of feature flags approached, we used a very similar concept in a project I worked before on and this works very well and it is easy to use, maybe it would be good if we could just setup the feature flags in a configuration or pass them as flags or configuration which can be just changed on startup or even in runtime instead of having to recompile the whole binary every time.

So instead of setting in the code:

FEATURES.set(VERSIONED_BLOCK_HEADER, PROTOCOL_VERSION > 10);

you would have it in the config.json or somewhere as:

{
  features: {
     VERSIONED_BLOCK_HEADER: true, // or
     ENABLED_VERSIONED_BLOCK_HEADER_VERSION: 10,
  }
}

This also allows us to use than just to change the config and spawn validators or whole networks to see if the behaviour is what we expect it to be without recompiling the binary.

Also in the future this would allow us to do more experimentation and host a service for versioning and tweaking configs for different nodes or network internally when we are testing breaking changes.

This doesn't have to be used even just for new breaking changes but just changes or optimizations in current code where we aren't sure that it really works where we can run the old code alongside the new code and just have a flag:

 if FEATURES.is_enabled(OPTIMIZED_CODE) {
    new optimized code
 } else {
   old code
}

and we can either revert or just test new changes without really removing the old code making us to revert the PR.

Oct 13 '20 14:10 chefsale

I was thinking something like for syntax when @frol mentioned it:

when used:

if feature!(EVM_SUPPORTED) {
... 
}

and then in version.rs defines new protocol version:

protocol_version!(40, vec![EVM_SUPPORT, ..]);

@chefsale I don't think we can have protocol in the config - I don't think there is a good way to go back or sometimes even change without recompilation.

Oct 13 '20 15:10 ilblackdragon

I like the idea of separating features from protocol versions in all places of the code except one (as @ilblackdragon is suggesting above).

This makes the code much cleaner, but also helps us with the problem being discussed in Chain Sync today. @bowenwang1996 was looking for a way for block producers to roll back a feature which caused some problems by only changing their binaries. If protocol versions are associated with features in the binary, then rolling a feature back would be possible by having everyone change their binary to no longer associate that feature with the protocol version.

Or phrased another way, we re-define what protocol version X means. Initially we were saying that protocol version X means feature Y is enabled, but after some problem occurs we say, nevermind protocol version X does not include feature Y, thus rolling it back.

Oct 13 '20 16:10 birchmd

This makes the code much cleaner, but also helps us with the problem being discussed in Chain Sync today. @bowenwang1996 was looking for a way for block producers to roll back a feature which caused some problems by only changing their binaries.

@birchmd This problem still exists even with this change, because the issue is not about the features or the version supported by the binary, but the version that is agreed upon with consensus to be used in the chain.

For features that affect how consensus works what we are doing today is that we are releasing a new binary (with version bumped) that initially runs without that feature until more than 80% of the validators(stake) claim they are using the new version.

If we find out that the feature/version should be reverted for some reason (there are reasons more problematic than others, like the consensus is stalled) validators need to reach consensus again on using the previous version. For example, when the feature involves a layout change on some chain-related data structures such as blocks, validators must be able to disambiguate at any point which type of block is expected, and which one is invalid.

Oct 13 '20 16:10 mfornet

@mfornet I know we agree on what protocol version is used via an on-chain vote, but I thought the definition of what that protocol version means is left implicit (i.e. there is no on-chain definition as to what each protocol version change entails). So we can still have the chain stay on whatever version was last decided on while rolling back a feature by changing what it means to be at that protocol version.

You are right that changing data structures could make things tricky, but I think those could be addressed on a case-by-case basis. Ultimately we always have the hard-fork escape hatch if we are really stuck, but this type of binary-only change could perhaps still work sometimes.

Oct 13 '20 17:10 birchmd

@frol It seems to me that using feature flags solve the problem of choosing the changes that should go into the next protocol upgrade, which is great. However, it doesn't solve the issue of how we test it -- ideally you want features to have composability so that we can easily disable or enable them on some testing network. Sadly that is not the case, as I have mentioned above and it seems so far we don't have a good solution to that other than doing hard forks.

Oct 13 '20 17:10 bowenwang1996

So we can still have the chain stay on whatever version was last decided on while rolling back a feature by changing what it means to be at that protocol version.

Assuming I understand your proposal correctly, it doesn't work, because after the vote passed, some blocks / chunks were already produced with the protocol version voted for and with the feature enabled. If we later change the correspondence of versions to features, such blocks and chunks will become invalid, because the new nodes trying to validate them will validate with the feature disabled.

So if we have to disable some feature, we need to increase the protocol version once again, so that the blocks / chunks produced so far have the version that has the feature enabled, and the blocks / chunks produced later have the version that has the feature disabled.

Oct 13 '20 17:10 SkidanovAlex

well I was about to comment but @SkidanovAlex said exactly what I wanted to say :)

Oct 13 '20 17:10 bowenwang1996

Would a DB migration not be sufficient to make sure data produced during the time when the feature was enabled is corrected to match the old protocol version? The DB version is already separate from the protocol version (as it should be), so we can bump that independently.

I think the main issue will be how nodes running the binary with the feature enabled interact with those which have it disabled. But as long as the majority of stake change at roughly the same time this should not be too much of a problem. Having them all change at once introduces a service outage of course, but I thought the premise of this emergency rollback situation was that a critical issue was already causing a service outage in the first place.

Oct 13 '20 17:10 birchmd

Would a DB migration not be sufficient to make sure data produced during the time when the feature was enabled is corrected to match the old protocol version? The DB version is already separate from the protocol version (as it should be), so we can bump that independently.

I don't think DB migration is necessarily bijective.

I think the main issue will be how nodes running the binary with the feature enabled interact with those which have it disabled. But as long as the majority of stake change at roughly the same time this should not be too much of a problem. Having them all change at once introduces a service outage of course, but I thought the premise of this emergency rollback situation was that a critical issue was already causing a service outage in the first place.

I think emergency update is only one of the situations in consideration here. Even if all the node agree to stop at exactly the same time, due to time difference on different machines, it is likely that we end up in a situation where between 1/3 and 2/3 stop on one height and the rest stop on some other height, which will cause issues when they restart.

Oct 14 '20 00:10 bowenwang1996

Summarizing the discussions and adding some of my ideas on how to proceed:

It seems that we all agree that using feature flags to control what features to enable in the protocol upgrade. This allows us to have flexibility as to what features to include and can potentially have features in the codebase that need more testing.
The question remains, as I posted above, how we can test the features, especially when we want to include or drop some new features. One simple idea is to do a hard fork on the testing network every time we want to change the feature set to avoid the upgrade issue discussed above. As brutal as it may sound, it may not be the end of the world. We have already planned to repurpose betanet to test protocol changes immediately and we can streamline the testing and hard forks on betanet. When we think the release is ready, we bump the protocol version and publish a release candidate which we run on testnet, and, after a short amount of testing, we can say that the new version is ready and publish as a new stable version. The upshot is that only on betanet will we do hard forks relatively frequently and on testnet, assuming things go well, we only have to do hard forks occasionally when things go wrong.

Oct 14 '20 00:10 bowenwang1996

For clarity, I suggest we do not differentiate disabling feature and enabling feature. Specifically, if later we decide to disable VERSIONED_BLOCK_HEADER we create feature DISABLE_VERSIONED_BLOCK_HEADER

Also, for clarity, I suggest we differentiate three types of feature rollout:

Normal -- network is running normally. Validators are updating to protocol version X that has feature VERSIONED_BLOCK_HEADER enabled and when there are 80% of validators with this protocol version X is triggered;
Stalled network -- network is stalled because of bad feature rollout, we communicate through social channels that validators need to restart their nodes with --disable_feature VERSIONED_BLOCK_HEADER or --disable_feature DISABLE_VERSIONED_BLOCK_HEADER which resumes the network with socially amended version X (let's call it X'). All blocks that were produced by nodes that run under amended version will include Amended(VERSIONED_BLOCK_HEADER) as the first special transaction. Then we release version X+1 with disabled feature which validators upgrade to through a normal process;
Emergency -- network is just operating incorrectly. We communicate through social channels that validators need to vote through a contract emergency_feature_amendment.disable(VERSIONED_BLOCK_HEADER). Once all validators observe that 80% of current validators voted for it feature VERSIONED_BLOCK_HEADER is getting disabled in version.rs. Also think of it as socially amended version X (let's call it X'). All blocks that were produced by nodes that run under amended version will include Amended(VERSIONED_BLOCK_HEADER) as the first special transaction. Then we release version X+1 with disabled feature which validators upgrade to through a normal process;

@mfornet this seems to address your concern here: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707863416 and @bowenwang1996 this seems to answer your question how we make all nodes do it together here: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707440972

To test the feature we can start all nodes on testnet with --enable_feature VERSIONED_BLOCK_HEADER which will amend the current version X if X does not include this feature, because it is under development. @bowenwang1996 this seems to address your question on how to test it: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707881183

@bowenwang1996 @birchmd

You are right that changing data structures could make things tricky, but I think those could be addressed on a case-by-case basis.

Could you give an example of data structure change feature that we cannot disabled using the above (1)-(3) methods?

@SkidanovAlex answering your question on how nodes will be able to re-validate past blocks that were produced with amended version: https://github.com/nearprotocol/NEPs/issues/126#issuecomment-707882421 When block is validated if the first transaction is Amended(VERSIONED_BLOCK_HEADER) then validator knows that they need to disable this feature to be able to validate this block. It works for most of the features, maybe except the changes to the light client.

Oct 15 '20 01:10 maxzaver

All blocks that were produced by nodes that run under amended version will include Amended(VERSIONED_BLOCK_HEADER) as the first special transaction. Then we release version X+1 with disabled feature which validators upgrade to through a normal process;

Are you referring to a transaction included in a block? If so, how do we make sure that the block is the first block that everyone applies?

We communicate through social channels that validators need to vote through a contract emergency_feature_amendment.disable(VERSIONED_BLOCK_HEADER)

Is this a smart contract on NEAR? If the chain is operating incorrectly I don't think we should keep using it for any purpose.

Oct 15 '20 16:10 bowenwang1996

Just to sum up the solution we seem to arrive from the implementation point of view:

Use compile-time Rust features to guard nightly implementations defined in Cargo.toml and use stable-protocol and nightly-protocol feature groups.

stable-protocol = ["protocol-feature-versioned-block-header"]
nightly-protocol = ["stable-protocol", "protocol-feature-evm"]

Later, in some version.rs file:

enum ProtocolFeatures {
  #[feature("protocol-feature-versioned-block-header")]
  VersionedBlockHeader,
  #[feature("protocol-feature-evm")],
  EVM,
}

#[feature("stable-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
  41: [ProtocolFeatures::VersionedBlockHeader],
}

#[feature("nightly-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
  39: [ProtocolFeatures::VersionedBlockHeader, ...],
  ...
  42: [ProtocolFeatures::EVM, ...],
}

This way we can merge half-backed features into master and test them only when compiled with nightly-protocol feature, and we can still build stable release from the same code-base without fear of that nightly code breaks anything on stable (it won't be there).

Using enums we have extra sanity checks (we don't need to manually keep some counter incremented like we have with the Cols in our storage), and the compiler will catch typos and other human errors.

/cc @SkidanovAlex

Oct 15 '20 17:10 frol

Thank you @frol !

@bowenwang1996 could you please lead (or delegate to someone) the implementation of version.rs and the rest of the triggering mechanism needed for EVM? We could first do it for EVM, and postpone refactoring the current versioning code for later, so that we can rollout EVM faster.

Oct 15 '20 17:10 maxzaver

When we introduce a new feature, do we increment the nightly protocol version as well? Basically if we have

#[feature("nightly-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
  41: [ProtocolFeatures::VersionedShardChunkHeader, ...]
}

and when we introduce evm, do we change it to

#[feature("nightly-protocol")]
const PROTOCOL_VERSION_FEATURES_MAPPING = {
  41: [ProtocolFeatures::VersionedShardChunkHeader, ...]
  42: [ProtocolFeatures::VersionedShardChunkHeader, ProtocolFeatures::EVM, ..]
}

assuming that ProtocolFeatures::VersionedShardChunkHeader is not stable yet.

This way it is easier to upgrade on the testing networks. Otherwise we have to do hard fork or figure out some other solution to upgrade to the new feature.

Oct 15 '20 17:10 bowenwang1996

With borsh limit, we can only have 255 features. Just need to make sure we don't need to serialize this

Oct 15 '20 17:10 evgenykuzyakov

When we introduce a new feature, do we increment the nightly protocol version as well? Basically if we have

My understanding from @frol is yes. we should always introduce nightly protocol version first, then bump stable protocol version to be some version <= latest nightly protocol version

Oct 15 '20 18:10 ailisp

Hm, given we want to iterate and fix things within one feature over some period of time. This means that it nightly will need to hard fork.

Which means I'm not sure we need for nightly protocol versions per se - just may be there is "nightly protocol version" which includes whatever features are needed.

Oct 16 '20 16:10 ilblackdragon

Which means I'm not sure we need for nightly protocol versions per se - just may be there is "nightly protocol version" which includes whatever features are needed.

That means we have to do hard forks on every new feature. If we have versions at least some, if not a majority, of them could be avoided. Basically we only do hard forks if the new feature goes wrong and there is no way to fix it without changing the protocol.

Oct 16 '20 16:10 bowenwang1996

Are we going to have a separate set of tests for nightly upgradability which covers all edge cases of changes for newly added features? E.g. if decoding of a single function call in EVM changes, hard fork is needed but there are no tests for checking this per se, and it's up to developer to catch that.

E.g. I don't think it make sense to upgrade nightly chain ever (unless a lot more work will be put in into intermediate changes) and doing hard fork on every release would totally make sense.

Oct 16 '20 16:10 ilblackdragon

E.g. I don't think it make sense to upgrade nightly chain ever (unless a lot more work will be put in into intermediate changes) and doing hard fork on every release would totally make sense.

So you are saying if we change upgradability itself, it just won't get tested ?

Oct 16 '20 17:10 bowenwang1996

I vote for that we only deploy nightly over stable, and don't bother with nightly to nightly migrations (dump some stable state, and recover from it [hard fork] -- do it on a network that we control, like crashnet). Having granual control over individual features (protocol_feature_evm) we may introduce nightly_protocol_features feature listing all of the nightly features while keep the nightly_protocol feature just a marker feature to switch the protocol version mapping while the features listed in the protocol version mapping are feature-gated, so you can compile a nightly build with only a single nightly feature enabled for testing:

$ cargo build --features nightly_protocol --features protocol_feature_evm

If you want to build all the nightly features:

$ cargo build --features nightly_protocol_features

Cargo.toml:

protocol_feature_evm = []
nightly_protocol = []  # always empty, just a marker feature
nightly_protocol_features = ["nightly_protocol", "protocol_feature_evm", ...]

As an afterthought, I think it makes sense to only extend the stable protocol features, so instead of:

#[cfg(not(feature = "nightly_protocol"))]
FEATURES_MAPPING = ...

#[cfg(feature = "nightly_protocol")]
FEATURES_MAPPING = ...

I suggest we do:

pub const PROTOCOL_VERSION: ProtocolVersion = {
   40 + if cfg!(feature = "nightly_protocol") { 1000 } else { 0 }
};

#[cfg(feature = "nightly_protocol")]
const NIGHTLY_PROTOCOL_VERSION = PROTOCOL_VERSION;

FEATURES_MAPPING = {
    #[cfg(feature = "protocol_feature_evm")]
    ProtocolFeatures::EVM => NIGHTLY_PROTOCOL_VERSION,
};

Compiling protocol_feature_evm without nightly_protocol will fail to compile (clearly indicating that it is a nightly feature).
Compiling protocol_feature_evm with nightly_protocol only enables this single feature, so we can test it in isolation if needed
To promote the feature from nightly to stable you must add the protocol_feature_evm to default features and unless it is not assigned the specific protocol version number (instead of NIGHTLY_PROTOCOL_VERSION), it will fail to compile (which may prevent human errors).

/cc @bowenwang1996

Oct 19 '20 12:10 frol

The problem with always doing hard fork to compose features is that the upgrade path is never tested and from my experience, that has been the cause of problems several times in the past.

Oct 19 '20 17:10 bowenwang1996

That makes sense. Let's stick to independent FEATURES_MAPPING then (though, I would add some compile-time sanity checks preventing stable having some features enabled without being enabled in nightly).

Oct 21 '20 08:10 frol

During a conversation with @evgenykuzyakov, I realized that there have been some miscommunications on how we are going to use nightly protocol features. I thought that they are going to be always implemented in a backward compatible way so that we can easily upgrade from enabling one feature to enabling 3 features without problem. However, @evgenykuzyakov brings up a good point that he wants to have fast iteration on evm and therefore needs to be able to make breaking changes easily. This, nonetheless, may lead to a number of issues (assuming that we are testing them on betanet first):

We need to do a hard fork every time a new feature is introduced. Not only that, we also need to do a hard fork when more breaking changes are made to a feature after it is introduced.
Since we do hard forks all the time, we still need to test the upgrade process for the features introduced before they are stabilized.

I think this is still achievable if we do the following:

We increment PROTOCOL_VERSION guarded by nightly_protocol every time a new feature is introduced or some breaking change is made to an existing feature. We also need to change the corresponding protocol versions in PROTOCOL_FEATURES_TO_VERSION_MAPPING. Along with the change comes a hard fork.
When we are ready to make a release (assuming at this point the backward compatibility is addressed), we again increment PROTOCOL_VERSION and also protocol versions in the mapping without doing a hard fork. This way after the nodes restart they will upgrade to the new protocol by going through the normal upgrade process.

What do people think? @frol @SkidanovAlex @nearmax @ilblackdragon

Nov 04 '20 00:11 bowenwang1996

Sounds good to me. My only ask is to have a good infrastructure for testing feature stabilization PRs making sure the migrations are fine on real data and older nodes can communicate with the new nodes (we already have some of it, but there is never enough tests)

Nov 04 '20 11:11 frol

NEPs
NEPs copied to clipboard

Discussion: Upgrade triggering mechanism. Enabling EVM.

Motivation

Proposal

Separate guarding and triggering features

One place for triggering features

Introduce "nightly" protocol version

Termination button

EVM

Special purpose Python scripts for Betanet

Extra conditions

NEPs NEPs copied to clipboard

Discussion: Upgrade triggering mechanism. Enabling EVM.

Motivation

Proposal

Separate guarding and triggering features

One place for triggering features

Introduce "nightly" protocol version

Termination button

EVM

Special purpose Python scripts for Betanet

Extra conditions

NEPs
NEPs copied to clipboard