substrate icon indicating copy to clipboard operation
substrate copied to clipboard

`zombienet`: warp-sync tests for validators

Open michalkucharczyk opened this issue 2 years ago • 13 comments

Sketch of warp-sync test for validators. Not tested.

warp-sync is now failing with babe-skip-epochs, the reason is here: https://github.com/paritytech/substrate/pull/12751#issuecomment-1328032781 Follow-up of: #12769

michalkucharczyk avatar Jan 05 '23 14:01 michalkucharczyk

Dumb question. Can't we merge this test (0002) with 0001 by warp syncing two nodes there? I.e. by adding a warp sync for a validator in 0001?

davxy avatar Jan 06 '23 09:01 davxy

Dumb question. Can't we merge this test (0002) with 0001 by warp syncing two nodes there? I.e. by adding a warp sync for a validator in 0001?

For me those scenarios are slightly different, and different failures are observed. I would prefer to keep them separate. Just to summarize:

  1. 0001: full nodes only: three synced from db, one warp-synced,
  2. 0002: two validators warp-synced, three full nodes synced from db
  3. 0003: validators synced from db, one full node synced from db, full node warp-synced

michalkucharczyk avatar Jan 06 '23 11:01 michalkucharczyk

Yeah. Your are right, better to separate the scenarios.


So as we've discussed in the chat we have 2 issues to resolve.

  1. (test case 0002) : validator fails after being warp synced
  2. (test case 0003) : full node fails after being warp synced

Both the cases are due to epoch skipping feature not handling two edge cases.

Case 3 has been already deeply discussed https://github.com/paritytech/substrate/pull/12751#issuecomment-1328032781 and we need a fix for the gap. I may attempt to relax the gap checks when searching for epoch data.


Case 1.

**TL;DR: the issue here is that the oracle is_major_syncing() returns false (sync state is Idle) but the blocks were not all imported yet by the backend **

Long explanation:

Here the failure is due to a validator trying to build a new block after being warp synced.

This is the execution stack triggering the failure (less depth first)

  • next_slot() is called by the block authoring worker when is time to attempt block production
  • create_inherent_data_provider() closure called with the last best block (i.e. the one the sync leaves us with) here
  • create_uncles_inherent_data_provider() called here
  • Client::uncles() called here
  • ...
  • Starting from current best block (0x5448..., i.e. the point where the warp sync positioned us) we begin collecting uncles here.
    • (hash=0x5448..., height=12133) => OK (THIS IS THE BEST BLOCK according to warp sync)
    • (hash=0x9e94..., height=12132) => OK
    • (hash=0xaea8..., height=12131) => OK (THIS IS THE LAST FINALIZED BLOCK according to warp sync)
    • (hash=0x2e3b..., height=12130) => FAIL (block header not found error)

That is a bit strange since when the oracle returns is_major_syncing() == false the headers should have been downloaded (i.e. we don't attempt block authoring if we've not downloaded the blocks)

So here instead of return I've inserted a continue.

And indeed it fails 2 times and then at the third attempt it works.

If we print the logs of babe we see that when the next_slot() is invoked the block_import for old blocks has not finished to import all the blocks yet.

When all the blocks are imported, 0x2e3b is now present.

(Note that the blocks from last finalized (12131) to best (12133) are imported first... then the import begins from 1, 2 ,3 ...)

To wrap-up: blocks-downloaded != blocks-imported In next_slot() is not sufficient to check if the network has finished downloading the blocks. We also need to check if we imported the one below the last finalized.

davxy avatar Jan 06 '23 19:01 davxy

Test case 3 (https://github.com/paritytech/substrate/pull/13078#issuecomment-1374021848) should be fixed by ~https://github.com/paritytech/substrate/pull/13106~ https://github.com/paritytech/substrate/pull/13127

davxy avatar Jan 09 '23 18:01 davxy

bot rebase

michalkucharczyk avatar Jan 13 '23 11:01 michalkucharczyk

Rebased

Test case 2 (https://github.com/paritytech/substrate/pull/13078#issuecomment-1374021848) should be fixed by https://github.com/paritytech/substrate/pull/13155

davxy avatar Jan 16 '23 15:01 davxy

That is a bit strange since when the oracle returns is_major_syncing() == false the headers should have been downloaded (i.e. we don't attempt block authoring if we've not downloaded the blocks)

Dumb question: why cannot is_major_syncing be changed to reflect the state of import?

michalkucharczyk avatar Jan 17 '23 09:01 michalkucharczyk

bot rebase

michalkucharczyk avatar Jan 18 '23 10:01 michalkucharczyk

Rebased

Requires https://github.com/paritytech/substrate/pull/13135

davxy avatar Jan 18 '23 13:01 davxy

bot rebase

michalkucharczyk avatar Jan 25 '23 11:01 michalkucharczyk

Rebased

Test were not passing due to this: https://github.com/paritytech/zombienet/issues/700

michalkucharczyk avatar Jan 26 '23 19:01 michalkucharczyk