ic icon indicating copy to clipboard operation
ic copied to clipboard

test: rejoin test with slow catch-up

Open mraszyk opened this issue 1 month ago • 0 comments

This PR adds a node rejoin test with a slow catch-up provoked by long DSM rounds:

  • creating many canisters (100,000) - so that iterating over all canisters is slow;
  • deploying a few "busy" canisters - so that executing those canisters is slow.

The test can be run using the following command:

ict t //rs/tests/message_routing:rejoin_test_long_rounds

Runbook:

  • setup the testnet of 3f + 1 nodes with f = 4 (like on mainnet);
  • pick a random node and install 4 "seed" canisters through it (the state sync test canister is used as "seed");
  • create 100,000 canisters via the "seed" canisters (in parallel);
  • deploy 8 "busy" canisters (universal canister with heartbeats executing 1.8B instructions);
  • pick another random node and kill that node;
  • wait for the subnet producing a CUP;
  • start the killed node.

Success: the restarted node catches up w.r.t. its certified height and becomes healthy until the next CUP.

In the attached screenshot showing how much the restarted node is lagging behind, we see that the restarted node is catching up only very slowly at the moment.

Screenshot from 2025-12-14 23-34-39

mraszyk avatar Dec 08 '25 11:12 mraszyk