cosmos-sdk fix: snapshot recover from exporter error

Description

During snapshots, it is difficult to tell which store was not able to export properly.

In the case of Dig chain, one store was not fully configured (no nodedb) so the returned exporter should be nil.

Cosmos-sdk can then decide whether to panic or skip the possibly unused store.

Added print when each store begins and completes an export.

This related update in cosmos/iavl properly returns a nil exporter on error: https://github.com/cosmos/iavl/pull/622/files#diff-9d9ca52939a19f26358e2b5b35ca938679b381be35f4c54379df94c4fa4d1233R44

Logging for rootmulti was previous hardcoded to NewNopLogger which did NOT log.

Now developers/users can set an environment variable to turn on logging for this package: LOG_ROOTMULTI=1 daemond start

Author Checklist

[X] included the correct type prefix in the PR title
[X] targeted the correct branch (see PR Targeting)
[X] provided a link to the relevant issue or specification
[ ] followed the guidelines for building modules
[ ] included the necessary unit and integration tests
[ ] added a changelog entry to CHANGELOG.md
[X] included comments for documenting Go code
[X] updated the relevant documentation or specification
[X] reviewed "Files changed" and left comments if necessary
[ ] confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.

I have...

[ ] confirmed the correct type prefix in the PR title
[ ] confirmed ! in the type prefix if API or client breaking change
[ ] confirmed all author checklist items have been addressed
[ ] reviewed state machine logic
[ ] reviewed API design and naming
[ ] reviewed documentation is accurate
[ ] reviewed tests and test coverage
[ ] manually tested (if applicable)

Nov 20 '22 17:11 chillyvee

Example output which is very helpful for debugging snapshots:

12:00PM INF creating state snapshot height=4412550
12:00PM INF indexed block exents height=4412550 module=txindex
Snapshot.Export Store Key acc
Snapshot.Export Store Key Done acc
Snapshot.Export Store Key authz
Snapshot.Export Store Key Done authz
Snapshot.Export Store Key bank
Snapshot.Export Store Key Done bank
Snapshot.Export Store Key capability
Snapshot.Export Store Key Done capability
Snapshot.Export Store Key distribution
Snapshot.Export Store Key Done distribution
Snapshot.Export Store Key evidence
Snapshot.Export Store Key Done evidence
Snapshot.Export Store Key feegrant
Snapshot.Export Store Key Done feegrant
Snapshot.Export Store Key gov
Snapshot.Export Store Key Done gov
Snapshot.Export Store Key ibc
Snapshot.Export Store Key Done ibc
Snapshot.Export Store Key icahost
iavl/export newExporter failed to create because tree.ndb is nil  <-- this is helpful
Snapshot.Export Store Key icahost exporter nil
Snapshot.Export Store Key mint
Snapshot.Export Store Key Done mint
Snapshot.Export Store Key params
Snapshot.Export Store Key Done params
Snapshot.Export Store Key slashing
Snapshot.Export Store Key Done slashing
Snapshot.Export Store Key staking
Snapshot.Export Store Key Done staking
Snapshot.Export Store Key transfer
Snapshot.Export Store Key Done transfer
Snapshot.Export Store Key upgrade
Snapshot.Export Store Key Done upgrade
Snapshot.Export Store Key wasm
Snapshot.Export Store Key Done wasm
12:05PM INF completed state snapshot format=2 height=4412550

It could be argued that a program should crash when a store doesn't have a nodedb since that could indicate a deeper problem. If that is desired let me know.

The idea here is to make it clear exactly which misconfigured store is crashing the daemon rather than a panic in a location which is called many times with different conditions.

In the case of Dig Chain, snapshots without the icahost store restored fine and validator nodes are signing with the proper apphash values. So it is likely that is an unused store that is mistakenly added, or mistakenly unconfigured.

Nov 20 '22 17:11 chillyvee

Changed all fmt.Printf to rs.logger

Found out that the original logger was a NopLogger which DID NOT respect the cli --log_level

The code path does not make it obvious how to attach to the global app logger, nor does it make sense to turn on consensus debugging when trying to debug rootmulti.

Thus we allow a runtime setting of environment variable LOG_ROOTMULTI to turn on logging for this package.

Example usage:

LOG_ROOTMULTI=1 daemond start

Nov 21 '22 13:11 chillyvee

https://github.com/cosmos/cosmos-sdk/issues/12228

Nov 22 '22 21:11 faddat

Do not merge, found a bug (noted above)

Nov 24 '22 22:11 chillyvee

Here's an update on this PR.

Previously the effort was focused on preventing panic/nil/crashes during a snapshot.

After review of chains with the current crashing problem, we have determined that chains introducing a new store without a keeper and a call to SetStoreLoader(upgradetypes.UpgradeStoreLoader(upgradeInfo.Height will introduce the new store at version = 0.

This new store will continue to increment versions by 1 as enforced by cosmos/iavl on each block.

The crashing problem appear then when a store within rootmulti must be selected for the current block height for all stores.

However the new improperly introduced store, does not have a "version" matching the current height, and thus any effort to snapshot will fail.

For example, if your rootmulti has stores at height 1234

bank - version 1234
slashing - version 1234

Then a chain adds a module such as authz

This results in commit with stores in rootmulti like:

bank - version 1234
slashing - version 1234
authz - version 0

A proper snapshot now needs to consider the actual referenced version of each store during the commit. This PR now loads the commit, determines the actual version of each store, and places that information into the snapshot.

The snapshot protobuf format had to be modified because the restore function could no longer assume that all height match. Since IAVL checksums include the store version, nodes will AppHash error on the first modification to that store since the new checksum will include the mismatched version.

For example, all other nodes on the network 100 blocks after the upgrade will have:

bank - version 1334
slashing - version 1334
authz - version 100

But the restored node which assumes all restored stores have the same height will have

bank - version 1334
slashing - version 1334
authz - version 1334

To avoid this, we utilize the newly added Version to ensure that authz is restored to version 100 to match all other validator nodes.

This avoid the situation where a team deploying an upgrade in an imperfect function immediately penalizes all validators by breaking snapshots creation and statesync restore. The fact that a team is not perfect should not manifest as a panic especially since the the chain works otherwise.

Please note, we encountered issues using the protobuf generation functions and would appreciate if another member could run the protobuf generation and see if the results match or if additional corrections are needed.

Dec 02 '22 23:12 chillyvee

The below documents the reason help is necessary to properly regenerate the snapshot protobuf pb.go files.

This error is current as of cosmos/cosmos-sdk main branch at commit 25449b5812c42e8528daee054120f2deddb3e717

 % make proto-format 
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./tx/textual/internal/testpb/1.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/snapshots/v1beta1/snapshot.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/v1beta1/listening.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/tx.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/evidence.proto


 % make proto-gen
Generating Protobuf files
Generating gogo proto code
cp: can't stat 'github.com/cosmos/cosmos-sdk/*': No such file or directory
make: *** [Makefile:400: proto-gen] Error 1

With a manual run inside the docker image:

/workspace/proto $ buf generate --template buf.gen.gogo.yaml ./cosmos/base/store/snapshots/v1beta1/snapshot.proto

Does not produce an output file. No errors are reported.

/workspace/proto $ buf --version
1.9.0

Dec 02 '22 23:12 chillyvee

Thank you for the review @alexanderbez

Noted above is that protobuf to go code generation appears broken. Is the code generation working for your team? Would be best to have matching .proto and .go code in the final commit.

Dec 07 '22 23:12 chillyvee

The below documents the reason help is necessary to properly regenerate the snapshot protobuf pb.go files.

This error is current as of cosmos/cosmos-sdk main branch at commit 25449b5

 % make proto-format 
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./tx/textual/internal/testpb/1.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/snapshots/v1beta1/snapshot.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/base/store/v1beta1/listening.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/tx.proto
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: clang-format -i ./proto/cosmos/evidence/v1beta1/evidence.proto


 % make proto-gen
Generating Protobuf files
Generating gogo proto code
cp: can't stat 'github.com/cosmos/cosmos-sdk/*': No such file or directory
make: *** [Makefile:400: proto-gen] Error 1

With a manual run inside the docker image:

/workspace/proto $ buf generate --template buf.gen.gogo.yaml ./cosmos/base/store/snapshots/v1beta1/snapshot.proto

Does not produce an output file. No errors are reported.

/workspace/proto $ buf --version
1.9.0

@julienrbrt any idea why this is happening, I cant reproduce it

Dec 08 '22 11:12 tac0turtle

@julienrbrt any idea why this is happening, I cant reproduce it

I cannot reproduce it either. Works for me in GitHub Codespaces as well. @chillyvee can you retry in codespaces too and what is your OS? @likhita-809 I believe this you had something similar, have you discovered why?

Dec 08 '22 11:12 julienrbrt

Works in github codespaces but not Ubuntu 20.04 x64

This is the OS I'm using for the x64 dev environment

NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"

REPOSITORY                     TAG           IMAGE ID       CREATED        SIZE
ghcr.io/cosmos/proto-builder   0.11.2        32668103e373   4 weeks ago    1.24GB

I will delete the docker image I have and pull it again to see what happens - does not work

# docker rmi ghcr.io/cosmos/proto-builder:0.11.2
# docker pull ghcr.io/cosmos/proto-builder:0.11.2

0.11.2: Pulling from cosmos/proto-builder
213ec9aee27d: Already exists 
4583459ba037: Pull complete 
93c1e223e6f2: Pull complete 
cbef1b99e503: Pull complete 
f99825e86aac: Pull complete 
be55fa1995db: Pull complete 
b05bebbcb2df: Pull complete 
54962337969b: Pull complete 
5aff3b65794d: Pull complete 
530c7ee0eccf: Pull complete 
d68ed598193d: Pull complete 
4e4099510e6a: Pull complete 
94369a9c35c8: Pull complete 
Digest: sha256:538f334c1069f2dd544e9f9997ada75c7fb03da21ae2816d8029de172d9c2365
Status: Downloaded newer image for ghcr.io/cosmos/proto-builder:0.11.2
ghcr.io/cosmos/proto-builder:0.11.2

 % make proto-gen
Generating Protobuf files
Generating gogo proto code
cp: can't stat 'github.com/cosmos/cosmos-sdk/*': No such file or directory
make: *** [Makefile:403: proto-gen] Error 1

REPOSITORY                     TAG           IMAGE ID       CREATED        SIZE
ghcr.io/cosmos/proto-builder   0.11.2        32668103e373   4 weeks ago    1.24GB

In codespaces

 $ docker images
REPOSITORY                     TAG       IMAGE ID       CREATED       SIZE
ghcr.io/cosmos/proto-builder   0.11.2    32668103e373   4 weeks ago   1.24GB

And the codespaces output is much more verbose

make proto-gen
Generating Protobuf files
Generating gogo proto code
go: downloading github.com/cosmos/gogoproto v1.4.3
go: downloading github.com/grpc-ecosystem/go-grpc-middleware v1.3.0
...

Looks like something is preventing the go build/mod process from completing properly.

Same docker image. Must just be something very strange with that particular release of Ubuntu+Docker?

Will set up alternate dev environment/use codespaces if that's all it takes to fix. Might not be worth the time to debug.

Dec 08 '22 11:12 chillyvee

@julienrbrt any idea why this is happening, I cant reproduce it

I cannot reproduce it either. Works for me in GitHub Codespaces as well. @chillyvee can you retry in codespaces too and what is your OS? @likhita-809 I believe this you had something similar, have you discovered why?

I believe it has something to do with Ubuntu+Docker release as mentioned by @chillyvee because I had tried debugging in several different ways but nothing seems to work locally, but it works fine in codespaces with the same docker image.

Dec 09 '22 08:12 likhita-809

if a module has a different version than the other modules this would be considered a bug from my understanding. If so, the chain should fix the issue with a halt and restart. Is this what has been happening in the wild @chillyvee

Dec 28 '22 10:12 tac0turtle

@chillyvee added passing of the logger here https://github.com/cosmos/cosmos-sdk/pull/14438. we can remove the usage of env

Dec 28 '22 12:12 tac0turtle

@tac0turtle - Passing logger - good

Regarding the different height, the chains in the wild have different heights because new stores are not properly added using migrations. New stores start at height 1(?) and do not match other stores.

Affected Chains documented here: Dig, Konstellation, Akash, Stride

https://github.com/cosmos/cosmos-sdk/issues/13933

We were also lucky enough to prevent a problem on defund testnet before by detecting/patching just hours before an upgrade took place. Not sure we'll be able to catch this for all upgrades for all chains.

Fixing "the bug" is also hindered because the implementation prevents incrementing store heights by more than 1. The only time a height can be incremented to the current height is during a migration where the set height is not an increment.

The known paths to address this bug are:

Use an exporter like this PR which snapshot the actual data as committed/stored rather than assuming matching heights
Change underlying APIs: Remove checks to allow arbitrary height changes
Destroy the store and all data within it
Move data to a new store and delete the existing one (no tools currently exist for this)

Dec 28 '22 17:12 chillyvee

Regarding the different height, the chains in the wild have different heights because new stores are not properly added using migrations. New stores start at height 1(?) and do not match other stores.

I think this is the issue we should try to fix. We should prevent this footgun, everything else is a bandaid to a larger issue.

we should add a check that if a store is added at version 1 with a chain that has more than 1 version it should not start. Then in testing and compiling people would find the issue

Jan 03 '23 08:01 tac0turtle

we should add a check that if a store is added at version 1 with a chain that has more than 1 version it should not start. Then in testing and compiling people would find the issue

This seems like a sensible check to add. @chillyvee does or can this PR tackle that too?

Jan 05 '23 16:01 alexanderbez

Will check to see if there's a "reasonable" way to add in a check. Last check, I didn't see an obvious place to do it beautifully. It's 2023 now, so maybe a new year will reveal a great location.

As a side note, Lum Network likely ran into this same issue today during their upgrade and are currently losing consensus quite often due to nodes panicing on various block heights (depending on pruning settings).

Jan 10 '23 17:01 chillyvee

@chillyvee there's been quite a bit of discussion on this PR. Is it in a ready state to review and merge or do we need to address more things?

Jan 10 '23 19:01 alexanderbez

@tac0turtle Let's come to an agreement on what will be accepted as the goal of this PR.

Logging - It seems this will be allowed to merge?
Snapshot Creation - Should we assume heights to match across stores and panic upon mismatch? Or use snapshot format 4 as provided to ensure snapshots are valid even if heights mismatch? https://github.com/cosmos/cosmos-sdk/pull/13935/files#diff-2bc7a86169d4c2f891d7ae74b21ca4bcc5787d10114e2eee215fe1ddba5c999eR816

As a validator, we prefer to have an enhanced snapshot format which is less prone to breakage. We've hotfixed almost 4-6 chains in recent memory in regards to this problem.

The more robust snapshot version 4 with height information reflects the actual state of the chain and comes at the expense of mildly more complex snapshot code. However it is no more complex than the code used to load commits to start running a chain, so it is not a wild increase to complexity.

Is there a case where breaking statesync is the preferred outcome of a slight chain misconfiguration?

Feb 03 '23 09:02 chillyvee

2. Snapshot Creation - Should we assume heights to match across stores and panic upon mismatch? Or use snapshot format 4 as provided to ensure snapshots are valid even if heights mismatch?

I would say the latter. different store versions aren't supported with in the store package today. With the PR from notional it should be mitigated so in the future chains shouldn't run into this problem. We will also work on the lack of documentation related to upgrading to assist in avoiding this.

Feb 03 '23 11:02 tac0turtle

The more robust snapshot version 4 with height information reflects the actual state of the chain and comes at the expense of mildly more complex snapshot code. However it is no more complex than the code used to load commits to start running a chain, so it is not a wild increase to complexity.

how does the height help at the item level when it exists at the snapshot level? https://github.com/cosmos/cosmos-sdk/blob/eb86b68caea0d2e5909512bd3163aab192f2aa27/store/snapshots/types/snapshot.pb.go#L28.

Feb 03 '23 11:02 tac0turtle

how does the height help at the item level when it exists at the snapshot level?

IAVL merkle tree roots chnage when store heights do not match. When the actual chain has mismatched heights, but the snapshot restore assumes all heights match, then the restored node immediately panics with an AppHash error.

Having the heights ensures that snapshots are restored as expected by current consensus.

Feb 03 '23 15:02 chillyvee

different store versions aren't supported with in the store package today.

Yes different store heights are not supported. The trouble most chains faced was due to the ability to operate and reach consensus even without support.

We can use the approach stated above and assume the problem will not occur again if that's the preferred solution.

With your feedback, we'll adjust the PR to address the desired points. Thank you @tac0turtle!

Feb 03 '23 15:02 chillyvee

oh sorry, I confused height with version. I think it makes more sense to call height version in the snapshot. It's the nomenclature in IAVL. sorry for the confusion

Feb 03 '23 16:02 tac0turtle

oh sorry, I confused height with version. I think it makes more sense to call height version in the snapshot. It's the nomenclature in IAVL. sorry for the confusion

Pretty sure you are right. Shall we include the version in the snapshot format to ensure statesync works even when the chain is not ideally configured or remove it and assume there are no problems and allow statesync to break?

As a note, this will also fix snapshots and statesync for chains that are currently broken and allow statesync to function until a formal fix is issued (sometimes taking 2 upgrades, one for removal of a misconfigured store and another upgrade to add the configured store back. This may involve data loss since there is no guidance on how to migrate data between two instances of the same store of different versions during a migration.)

Feb 03 '23 16:02 chillyvee

@chillyvee what's the latest with the PR? I notice a handful of the comments I left are still unaddressed. Can we push this over the finish line?

Feb 06 '23 21:02 alexanderbez

Will try to find some time later this week to finish up.

@alexanderbez - Your comments overall are good (even though I have not individually responded to each of them). Will clean up the code as you suggest.

Feb 07 '23 21:02 chillyvee

if we assume store versions/heights need to always be the same, do we still need this pr? Maybe the debugging part only, not the snapshot change?

if we have agreement on the above ill knock out the rest

Feb 22 '23 12:02 tac0turtle

if we assume store versions/heights need to always be the same, do we still need this pr? Maybe the debugging part only, not the snapshot change?

if we have agreement on the above ill knock out the rest

Correct, we will adjust this PR to be "debug info" only to work along with the "guaranteed height match". We should have a moment in the next few days to update the PR and test it now that Juno testnet is stable.

Feb 23 '23 20:02 chillyvee

amazing, let me know if you need any help

Feb 23 '23 20:02 tac0turtle

cosmos-sdk cosmos-sdk copied to clipboard

fix: snapshot recover from exporter error

Description

Author Checklist

Reviewers Checklist

cosmos-sdk
cosmos-sdk copied to clipboard