substrate CI friendly fast mode for try state checks

Reviving https://github.com/paritytech/substrate/pull/13286 Should resolve https://github.com/paritytech/polkadot-sdk/issues/234

Adds a new fast mode to TryStateSelect which skips slow running state checks.

Adds two new options [fast-all | fast-try-state] to on-runtime-upgrade --checks.

Example usage:

 ./polkadot \
     try-runtime \
     --runtime runtime-try-runtime.wasm \
     -lruntime=debug \
     on-runtime-upgrade \
     --checks=fast-all \
     live \
     --uri ws://localhost:9944

Feb 13 '23 22:02 Ank4n

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 17 '23 21:04 stale[bot]

Updates here?

Apr 20 '23 08:04 kianenigma

Updates here?

Repeat 😁

May 16 '23 09:05 kianenigma

Updates here?

Repeat 😁

I am working on refreshing this PR.

May 26 '23 10:05 Ank4n

@liamaharon Thanks for raising those points. Its still all open to discussion but I will write how I was thinking about this.

This is a cool idea, but I'd like to raise the possibility that it may be better to completely turn off try-state checks in the CI than to have them run partially.

Currently the try-state checks are disabled on CI (it only runs pre- and post-upgrade checks). So in a way what you are suggesting is what is happening currently.

Run partially, the green CI check loses its meaning. It no longer would signify that try-state hooks are passing, rather just that they may be passing. To be sure they're actually passing, they would need to be run again somewhere else anyway, so there seems to be little value in running them in the CI.

Another way to think about it is, this would allow every pallet to write very thorough try state tests. Some of these tests may take really long time (probably hours) and it would be great to run all of them on CI but we also don't want CI to be stuck for hours. The pallet developer still believes other tests would ensure 99.9% scenarios are covered and this slow running test is something they run occasionally outside CI to ensure all storage items are consistent to the expectations.

To give a more concrete example (which is actually the reason why we wanted to introduce the fast mode) we can look at this try-state check in staking pallet. It iterates over all nominators (~ 44k) and then for each nominator, it fetches all active validators (~ 300), gets their exposure, iterates over all of its stakers (could be upto 512) and checks if the nominator (in the first loop) is only present once in a validator's exposure. This is cubic time complexity and takes more than 2 hours to run currently.

Also, most of the times, a failing try-state checks may not be introduced by the current PR but through a sneaky bug that we only found out about later. I think we might even want to keep these tests to allow_failure and these are just indications to investigate and not blockers to a PR.

To summarise, I think what this PR is trying to do is find a middle ground where we can run try state checks in the CI that would cover most of the inconsistent state scenario but still enable pallet developer write some more thorough but expensive tests that they can run outside CI if they want to. Its also important to emphasise try-state check should only be our 3rd or 4th line of defence.

At most, they could alert the dev to some failing hooks. But, the dev would still need to run the full set of try-state hooks somewhere else anyway.

The green check may provide a false sense of security to devs who're unaware that the full set of hooks are not run in the CI.

The feature comes with a cost: it adds extra configuration to the cli, and introduces cognitive overhead for developers working on the try-state hooks needing to decide whether a check should be 'fast mode' or not.

I agree we should look to improve this, make it more intuitive and better documented. By default every test should be marked as fast test unless we notice they are taking really long to run in the CI. I do think though that running 99% of the try state checks on CI and ignoring one is better alternative than not running anything on CI. If there are complex changes to a pallet logic, a developer should always run full suite (may be we can have a bot command that runs the full suite on demand).

May 29 '23 10:05 Ank4n

bot rebase

Jun 12 '23 22:06 Ank4n

Rebased

Jun 12 '23 22:06 paritytech-processbot[bot]

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 12 '23 23:07 stale[bot]

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

Yes, will be reworking on this.

Jul 13 '23 10:07 Ank4n

substrate substrate copied to clipboard

CI friendly fast mode for try state checks

substrate
substrate copied to clipboard