chopsticks icon indicating copy to clipboard operation
chopsticks copied to clipboard

Timeout during new blocks creation

Open x3c41a opened this issue 9 months ago • 4 comments

I run a simple script - https://github.com/paritytech/ahm-dryrun/pull/17/files#diff-dcdc3e0b3362edb8fec2a51d3fa51f8fb8af8f70247e06d9887fa934834c9122

It check the state of the migration and keeps creating new blocks on Relay Chain and Asset Hub. However, I keep getting RPC timeouts from both remote endpoint (usually after 2-4 blocks, once got to 13) and local network (usually after 5-7 blocks).

Such small number of created blocks before the failure looks suspicious. However, the script is simple and I don't see a reason why it's failing, other than https://github.com/paritytech/polkadot-sdk/issues/8035 Though, await dev.newBlock() waits for a new block creation before sending subsequent requests... Some details: Relay Chain withdraws around 700 accounts within one block, splits it into 2 XCM Transact calls and sends it to Asset Hub. Asset Hub processes XCMs one-by-one and sends confirmations to Relay Chain. Once relay receives all the confirmation it proceeds with the next batch of the accounts.

Do you have an idea how to avoid RPC timeouts?

x3c41a avatar Mar 31 '25 19:03 x3c41a

@xlc @ermalkaleci

x3c41a avatar Mar 31 '25 19:03 x3c41a

what RPC node are you using? if is local one, what's the CPU usage?

can you have LOG_LEVEL=trace and share logs?

maybe you need to enable some logs on the RPC node side and open an issue to polkadot-sdk. it is not really Chopsticks issue when RPC times out.

xlc avatar Mar 31 '25 23:03 xlc

what RPC node are you using? if is local one, what's the CPU usage?

I used both: remote and local. If it comes to local, I monitored resource usage using htop and CPU usage was between 0.8% and 1.5% while memory was between 27% and 31% which seems reasonable.

can you have LOG_LEVEL=trace and share logs?

https://gist.github.com/x3c41a/6f44974c5427f90c6778ed1b658984f8

maybe you need to enable some logs on the RPC node side and open an issue to polkadot-sdk.

I was running omni-node with -lruntime=trace but it did not reveal any useful information. Image

I noticed the spike in blocktimes while running the tests. However, it might be just a noise because it happens 1/5 runs roughly.

x3c41a avatar Apr 01 '25 09:04 x3c41a

UPD: I was also experimenting with subway connected to both RC and AH and it did not help either

x3c41a avatar Apr 01 '25 10:04 x3c41a

My colleague mentioned that he had a similar problem with subalfred and that the fix resolved the problem. The problem was that some values were too big and were timing out so he reduced page/batch size to 1 and increased message & frame sizes. Decreasing batch size to 1 will generate another problem - the migration will be extremely slow. I tried setting max_batch_size in subway to 1 and running the migration (just to make sure that it helps) but it failed with the same timeout...

I remember overriding chopstick's batch size to smaller values but it was throwing an error saying that batch size can't be smaller than page size (or something like this but I remember it was strongly coupled to page size) and page_size was a function parameter that was coming from multiple places.

Given, that it's a matter of fetching big storage values can't we just increase ws timeouts in chopsticks? Decreasing batch_size would be another option but it will generate yet another problem with slow migration. @xlc what do you think?

x3c41a avatar Apr 01 '25 16:04 x3c41a

made 1.0.4 release with rpc-timeout config

xlc avatar Apr 02 '25 01:04 xlc

cc: @serban300

x3c41a avatar Apr 02 '25 06:04 x3c41a

I set the timeout to 10 mins and added some logs in the rpc-provider.

The RPC method that is failing is dev_newBlock:

Failing request: dev_newBlock with params:
501 |             const handler = this.__internal__handlers[ids[i]];
502 |             console.log(`Request in progress: ${handler.method}. Running time: ${(now - handler.start) / 1000} seconds. Params: ${handler.params}.`);
503 |             if ((now - handler.start) > this.__internal__timeout) {
504 |                 try {
505 |                     console.log(`Failing request: ${handler.method} with params: ${handler.params}`);
506 |                     handler.callback(new Error(`No response received from RPC endpoint in ${this.__internal__timeout / 1000}s`), undefined);
                                           ^
error: No response received from RPC endpoint in 600s
      at __internal__timeoutHandlers (/media/serban/data/workplace/sources/chopsticks/node_modules/@polkadot/rpc-provider/ws/index.js:506:38)
✗ test migration [651678.47ms]

From what I understand this is a method defined in chopsticks, so I don't think it's a node RPC server error. Also attaching the entire log.

@xlc @x3c41a Can this be a chopsticks issue ? Or a migration issue maybe ?

serban300 avatar Apr 02 '25 15:04 serban300

I am checking if that's a migration issue by running zombie-bite. It just got fixed today so I am still on my way...

x3c41a avatar Apr 02 '25 15:04 x3c41a

from the logs, there is no response of the last dev_newBlock call. that's unexpected. what's your testing scritp?

xlc avatar Apr 02 '25 20:04 xlc

Yes this can totally happen but it not chopsticks timeout, it's the WsProvider connecting to chopsticks calling dev_newBlock and chopsticks is still working to build the new block (maybe because it is reading entire state).

ermalkaleci avatar Apr 03 '25 04:04 ermalkaleci

Now the all thing makes sense, it's your WsProvider connected to chopsticks which is timing out. Obviously because it will take chopsticks a long long time to go through entire state.

ermalkaleci avatar Apr 03 '25 04:04 ermalkaleci

If I remember correctly, Serban was using this script and also added timeout: 60000 (not newly added rpc-timeout) as part of the Relay and AH config.

x3c41a avatar Apr 03 '25 06:04 x3c41a

Now the all thing makes sense, it's your WsProvider connected to chopsticks which is timing out. Obviously because it will take chopsticks a long long time to go through entire state.

You mean ws provider that is being created as part of setupNetworks call? If I understood correctly from the chopsticks implementation:

  • chopsticks itself spawns up multiple ws connections to the endpoint that is being provided as part of the config (local node in this case),
  • returns a ws connection to chopsticks instance.

The latter is what's failing so we should keep increasing that (timeout not rpc-timeout) until we find some magic value that works, right?

x3c41a avatar Apr 03 '25 06:04 x3c41a

from the provided log, dev_newBlock was called 14 times and only 13 were returned. so the last one was never returned. also there isn't any logs until timeout indicating the chopsticks instance was busy

it is possible there is some bug in chopsticks somewhere somehow makes some promise never resolved

I need to be able to reproduce this issue to investigate further

xlc avatar Apr 03 '25 06:04 xlc

Thank you for looking into this ! These are the steps for reproducing the issue:

clone the ahm-dryrun repo and checkout the issue branch

git clone https://github.com/paritytech/ahm-dryrun.git
cd ahm-dryrun
git checkout issue

The test that leads to the failure is in index.ts. You need a polkadot-parachain node running. Polkadot RPC port: 9945 and Polkadot Asset Hub rpc port: 9944. Or I think it's also ok to connect to remote endpoints. If you want to connect to remote endpoints you need to edit index.ts.

I pushed the wasm runtimes into the issue branch, so now you should just be able to do

just run

or

bun test ./index.ts --timeout=22000000

And this will eventually fail. If the timeout is 10 min, it will fail in about 15 min probably.

If you want to build the wasms yourself you can do:

git submodule update --init
cd runtimes
git checkout dev-asset-hub-migration
cd ../
just build-polkadot

serban300 avatar Apr 03 '25 07:04 serban300

You need a polkadot-parachain node running.

Small nits:

You can easily start the node from ahm-dryrun repo by running:

just build-omni-node
just run-chain

It will use 9944 and 9955 ports for Relay and AH respectively.

If you want to use remote endpoints just change this and this lines, we already have everything you need in .env.

and if you want to build wasms yourself you can also use just build-polkadot it will put the wasms under the runtime_wasm directory

x3c41a avatar Apr 03 '25 09:04 x3c41a

I tried the script, I made some changes and seems to be able to running fine.

TLDR; add a block number so that the db cache can be reused. use network.chain.newBlock instead of network.dev.newBlock.

The dev.* methods are calling the dev_* rpc via ws. This means it can be timed out if the new block action take too long. Using the chain.* methods is invoking the Chopsticks instance directly and won't have such issue.

import { setupNetworks } from '@acala-network/chopsticks-testing'

const {polkadot, assetHub} = await setupNetworks({
    polkadot: {
        endpoint: 'wss://polkadot.rpc.permanence.io',
        // resumes from the highest block.
        // resume: true,
        'wasm-override': 'polkadot_runtime.compact.compressed.wasm',
        // 'prefetch-storages': ['0x'],
        'runtime-log-level': 5,
        db: './dbs/polkadot.sqlite',
        port: 8000,
        timeout: 60000,
        block: 25465700
    },
    assetHub: {
        endpoint: 'wss://polkadot-asset-hub-rpc.polkadot.io',
        // resumes from the highest block.
        // resume: true,
        'wasm-override': 'asset_hub_polkadot_runtime.compact.compressed.wasm',
        // 'prefetch-storages': ['0x'],
        'runtime-log-level': 5,
        db: './dbs/polkadot-asset-hub.sqlite',
        port: 8001,
        timeout: 60000,
        block: 8579088
    },
});

console.log('Before setStorage');
await polkadot.dev.setStorage({
    rcMigrator: {
        rcMigrationStage: 'AccountsMigrationInit',
    },
})

await assetHub.dev.setStorage({
    ahMigrator: {
        ahMigrationStage: 'DataMigrationOngoing',
    },
});

await polkadot.chain.newBlock(); // Init -> Ongoing
await assetHub.chain.newBlock();


let rcMigrationStage = (await polkadot.api.query.rcMigrator.rcMigrationStage()).toHuman();
let rcStageName = rcMigrationStage ? Object.keys(rcMigrationStage)[0] : null;

console.log('11111111 rcMigrationStage before: ', rcMigrationStage);
while (rcStageName == 'AccountsMigrationOngoing') {
    console.log('11111111 before polkadot chain new block')
    await polkadot.chain.newBlock();
    console.log('11111111 after polkadot chain new block')
    await assetHub.chain.newBlock();
    console.log('11111111 after assetHub chain new block')

    rcMigrationStage = (await polkadot.api.query.rcMigrator.rcMigrationStage()).toHuman();
    rcStageName = rcMigrationStage ? Object.keys(rcMigrationStage)[0] : null;
    console.log('11111111 rcMigrationStage: ', rcMigrationStage);
    // asset hub's stage stays the same, no need to verify.
}

console.log('migration has finished');

xlc avatar Apr 07 '25 01:04 xlc

ok that's the culprit (ignoring the timeout, which only add confusions), as I previously identified, one of the block building started but never finished.

On the 13th block at AH, 8579101, the finalization hook never completes. Most likely an infinite loop . I guess I can add some timeout logic to prevent infinite loop on wasm but because storage fetching can take very long time, it is not really possible to determine a reasonable timeout value.

So there isn't much can be done here at Chopsticks.

xlc avatar Apr 07 '25 02:04 xlc

Hey @xlc

I was able to start the migration with zombie-bite (a tool similar to chopsticks but it produces the blocks at regular 6s cadence, not on-demand as chopsticks does). There I was able to start and run the migration (over 200 blocks for now) without any issue. It makes me think that an issue is on chopsticks' side. Will you have some spare cycles to look deeper into it? If I can help you in any way, just let me know!

x3c41a avatar Apr 08 '25 14:04 x3c41a