Timeout during new blocks creation
I run a simple script - https://github.com/paritytech/ahm-dryrun/pull/17/files#diff-dcdc3e0b3362edb8fec2a51d3fa51f8fb8af8f70247e06d9887fa934834c9122
It check the state of the migration and keeps creating new blocks on Relay Chain and Asset Hub. However, I keep getting RPC timeouts from both remote endpoint (usually after 2-4 blocks, once got to 13) and local network (usually after 5-7 blocks).
Such small number of created blocks before the failure looks suspicious. However, the script is simple and I don't see a reason why it's failing, other than https://github.com/paritytech/polkadot-sdk/issues/8035 Though, await dev.newBlock() waits for a new block creation before sending subsequent requests...
Some details: Relay Chain withdraws around 700 accounts within one block, splits it into 2 XCM Transact calls and sends it to Asset Hub. Asset Hub processes XCMs one-by-one and sends confirmations to Relay Chain. Once relay receives all the confirmation it proceeds with the next batch of the accounts.
Do you have an idea how to avoid RPC timeouts?
@xlc @ermalkaleci
what RPC node are you using? if is local one, what's the CPU usage?
can you have LOG_LEVEL=trace and share logs?
maybe you need to enable some logs on the RPC node side and open an issue to polkadot-sdk. it is not really Chopsticks issue when RPC times out.
what RPC node are you using? if is local one, what's the CPU usage?
I used both: remote and local. If it comes to local, I monitored resource usage using htop and CPU usage was between 0.8% and 1.5% while memory was between 27% and 31% which seems reasonable.
can you have LOG_LEVEL=trace and share logs?
https://gist.github.com/x3c41a/6f44974c5427f90c6778ed1b658984f8
maybe you need to enable some logs on the RPC node side and open an issue to polkadot-sdk.
I was running omni-node with -lruntime=trace but it did not reveal any useful information.
I noticed the spike in blocktimes while running the tests. However, it might be just a noise because it happens 1/5 runs roughly.
UPD: I was also experimenting with subway connected to both RC and AH and it did not help either
My colleague mentioned that he had a similar problem with subalfred and that the fix resolved the problem.
The problem was that some values were too big and were timing out so he reduced page/batch size to 1 and increased message & frame sizes. Decreasing batch size to 1 will generate another problem - the migration will be extremely slow. I tried setting max_batch_size in subway to 1 and running the migration (just to make sure that it helps) but it failed with the same timeout...
I remember overriding chopstick's batch size to smaller values but it was throwing an error saying that batch size can't be smaller than page size (or something like this but I remember it was strongly coupled to page size) and page_size was a function parameter that was coming from multiple places.
Given, that it's a matter of fetching big storage values can't we just increase ws timeouts in chopsticks? Decreasing batch_size would be another option but it will generate yet another problem with slow migration. @xlc what do you think?
made 1.0.4 release with rpc-timeout config
cc: @serban300
I set the timeout to 10 mins and added some logs in the rpc-provider.
The RPC method that is failing is dev_newBlock:
Failing request: dev_newBlock with params:
501 | const handler = this.__internal__handlers[ids[i]];
502 | console.log(`Request in progress: ${handler.method}. Running time: ${(now - handler.start) / 1000} seconds. Params: ${handler.params}.`);
503 | if ((now - handler.start) > this.__internal__timeout) {
504 | try {
505 | console.log(`Failing request: ${handler.method} with params: ${handler.params}`);
506 | handler.callback(new Error(`No response received from RPC endpoint in ${this.__internal__timeout / 1000}s`), undefined);
^
error: No response received from RPC endpoint in 600s
at __internal__timeoutHandlers (/media/serban/data/workplace/sources/chopsticks/node_modules/@polkadot/rpc-provider/ws/index.js:506:38)
✗ test migration [651678.47ms]
From what I understand this is a method defined in chopsticks, so I don't think it's a node RPC server error. Also attaching the entire log.
@xlc @x3c41a Can this be a chopsticks issue ? Or a migration issue maybe ?
I am checking if that's a migration issue by running zombie-bite. It just got fixed today so I am still on my way...
from the logs, there is no response of the last dev_newBlock call. that's unexpected. what's your testing scritp?
Yes this can totally happen but it not chopsticks timeout, it's the WsProvider connecting to chopsticks calling dev_newBlock and chopsticks is still working to build the new block (maybe because it is reading entire state).
Now the all thing makes sense, it's your WsProvider connected to chopsticks which is timing out. Obviously because it will take chopsticks a long long time to go through entire state.
If I remember correctly, Serban was using this script and also added
timeout: 60000 (not newly added rpc-timeout) as part of the Relay and AH config.
Now the all thing makes sense, it's your WsProvider connected to chopsticks which is timing out. Obviously because it will take chopsticks a long long time to go through entire state.
You mean ws provider that is being created as part of setupNetworks call?
If I understood correctly from the chopsticks implementation:
- chopsticks itself spawns up multiple ws connections to the endpoint that is being provided as part of the config (local node in this case),
- returns a ws connection to chopsticks instance.
The latter is what's failing so we should keep increasing that (timeout not rpc-timeout) until we find some magic value that works, right?
from the provided log, dev_newBlock was called 14 times and only 13 were returned. so the last one was never returned. also there isn't any logs until timeout indicating the chopsticks instance was busy
it is possible there is some bug in chopsticks somewhere somehow makes some promise never resolved
I need to be able to reproduce this issue to investigate further
Thank you for looking into this ! These are the steps for reproducing the issue:
clone the ahm-dryrun repo and checkout the issue branch
git clone https://github.com/paritytech/ahm-dryrun.git
cd ahm-dryrun
git checkout issue
The test that leads to the failure is in index.ts. You need a polkadot-parachain node running. Polkadot RPC port: 9945 and Polkadot Asset Hub rpc port: 9944. Or I think it's also ok to connect to remote endpoints. If you want to connect to remote endpoints you need to edit index.ts.
I pushed the wasm runtimes into the issue branch, so now you should just be able to do
just run
or
bun test ./index.ts --timeout=22000000
And this will eventually fail. If the timeout is 10 min, it will fail in about 15 min probably.
If you want to build the wasms yourself you can do:
git submodule update --init
cd runtimes
git checkout dev-asset-hub-migration
cd ../
just build-polkadot
You need a polkadot-parachain node running.
Small nits:
You can easily start the node from ahm-dryrun repo by running:
just build-omni-node
just run-chain
It will use 9944 and 9955 ports for Relay and AH respectively.
If you want to use remote endpoints just change this and this lines, we already have everything you need in .env.
and if you want to build wasms yourself you can also use
just build-polkadot
it will put the wasms under the runtime_wasm directory
I tried the script, I made some changes and seems to be able to running fine.
TLDR; add a block number so that the db cache can be reused. use network.chain.newBlock instead of network.dev.newBlock.
The dev.* methods are calling the dev_* rpc via ws. This means it can be timed out if the new block action take too long.
Using the chain.* methods is invoking the Chopsticks instance directly and won't have such issue.
import { setupNetworks } from '@acala-network/chopsticks-testing'
const {polkadot, assetHub} = await setupNetworks({
polkadot: {
endpoint: 'wss://polkadot.rpc.permanence.io',
// resumes from the highest block.
// resume: true,
'wasm-override': 'polkadot_runtime.compact.compressed.wasm',
// 'prefetch-storages': ['0x'],
'runtime-log-level': 5,
db: './dbs/polkadot.sqlite',
port: 8000,
timeout: 60000,
block: 25465700
},
assetHub: {
endpoint: 'wss://polkadot-asset-hub-rpc.polkadot.io',
// resumes from the highest block.
// resume: true,
'wasm-override': 'asset_hub_polkadot_runtime.compact.compressed.wasm',
// 'prefetch-storages': ['0x'],
'runtime-log-level': 5,
db: './dbs/polkadot-asset-hub.sqlite',
port: 8001,
timeout: 60000,
block: 8579088
},
});
console.log('Before setStorage');
await polkadot.dev.setStorage({
rcMigrator: {
rcMigrationStage: 'AccountsMigrationInit',
},
})
await assetHub.dev.setStorage({
ahMigrator: {
ahMigrationStage: 'DataMigrationOngoing',
},
});
await polkadot.chain.newBlock(); // Init -> Ongoing
await assetHub.chain.newBlock();
let rcMigrationStage = (await polkadot.api.query.rcMigrator.rcMigrationStage()).toHuman();
let rcStageName = rcMigrationStage ? Object.keys(rcMigrationStage)[0] : null;
console.log('11111111 rcMigrationStage before: ', rcMigrationStage);
while (rcStageName == 'AccountsMigrationOngoing') {
console.log('11111111 before polkadot chain new block')
await polkadot.chain.newBlock();
console.log('11111111 after polkadot chain new block')
await assetHub.chain.newBlock();
console.log('11111111 after assetHub chain new block')
rcMigrationStage = (await polkadot.api.query.rcMigrator.rcMigrationStage()).toHuman();
rcStageName = rcMigrationStage ? Object.keys(rcMigrationStage)[0] : null;
console.log('11111111 rcMigrationStage: ', rcMigrationStage);
// asset hub's stage stays the same, no need to verify.
}
console.log('migration has finished');
ok that's the culprit (ignoring the timeout, which only add confusions), as I previously identified, one of the block building started but never finished.
On the 13th block at AH, 8579101, the finalization hook never completes. Most likely an infinite loop . I guess I can add some timeout logic to prevent infinite loop on wasm but because storage fetching can take very long time, it is not really possible to determine a reasonable timeout value.
So there isn't much can be done here at Chopsticks.
Hey @xlc
I was able to start the migration with zombie-bite (a tool similar to chopsticks but it produces the blocks at regular 6s cadence, not on-demand as chopsticks does). There I was able to start and run the migration (over 200 blocks for now) without any issue. It makes me think that an issue is on chopsticks' side. Will you have some spare cycles to look deeper into it? If I can help you in any way, just let me know!