cumulus
cumulus copied to clipboard
Parachain gateway stuck on transaction revalidation
We stumbled upon a problem during our transaction throughput testing on our kusama parachain: revalidating transactions stuck in a loop on gateway nodes (?).
Context
Our network consists of 7 nodes,
3 collators - parachain-collator-swe🇸🇪, parachain-collator-deu🇩🇪, parachain-collator-ita🇮🇹
3 gateways - parachain-gateway-kor🇰🇷, parachain-gateway-deu🇩🇪, parachain-gateway-usa🇺🇸
1 extra archive node - parachain-archive
Only gateways can have peerings with collators, so there is no direct connectivity between parachain-archive and collators.
There is a graph for txpool_validations_scheduled node metric

During our testing, we are spamming one of the gateways with a lot of balance.transfer calls, client was located in Europe in all the following steps, except (3). (Load script code: benchmark.ts · GitHub 1)
We started with parachain-gateway-usa🇺🇸 as our first target, and everything went smoothly; collators handled every transaction (first spike on the graph, 1).
Then we proceeded with parachain-gateway-kor🇰🇷, and everything went well (second spike, 2).
Then we retried with parachain-gateway-usa🇺🇸 again, but with the client located in North America, resulting in success (third spike, 3). The sender location doesn’t make a difference.
But then, interesting things started to happen.

During parachain-gateway-deu🇩🇪 testing (orange line on the graph), we filled the transaction pool with transactions (4), and collators executed a couple of transactions… And then, the rest of the transactions were stuck in the loop on the gateway, revalidating and moving from Ready state to Future state and vice-versa (spiky orange line on the graph, 5).
Then we restarted parachain-gateway-kor🇰🇷 (6), collators executed another part of the initial batch of transactions, and the rest were stuck in the same loop again. Now there are two nodes in this loop: parachain-gateway-deu🇩🇪 and parachain-gateway-kor🇰🇷. (7)
Then parachain-gateway-usa🇺🇸, and the same story as with parachain-gateway-kor🇰🇷 (8)
Then parachain-archive (9), and there were no transactions executed at all. So restarting a gateway works as a pump; restarted gateway gathers some part of the transaction pool, manages to send some of them to the collator and then stops for some reason?
And finally, we restarted parachain-collator-deu🇩🇪 (10), and every transaction was finally processed (11).
This behaviour is reproducible, and we have found that adding 100ms latency to parachain-gateway-deu🇩🇪 network makes this issue go away.
Summary
parachain-gateway-deu🇩🇪 getting stuck under load, and only collator restart helps to resolve this issue. We also tested with transactions sent from Europe and North America with the same result (to parachain-gateway-deu🇩🇪 in both cases); sender location doesn’t make a difference; only transactions sent to parachain-gateway-deu🇩🇪 getting stuck.
What might be the cause of this behaviour? What can we do to prevent the malicious actor from making our gateways stuck in this state?
Is that possible to reproduce locally?
Are all gateways connected each other? But always one collator being connected to only its gateway?
This can be reproduced locally but with less consistency. It is crucial that you must have a gateway (= non-authoring node with archive pruning) in the local parachain network and send transactions to it; if the benchmark targets the collator node, no freeze will occur.
All nodes have peerings with each other, except parachain-archive, parachain-archive has no peerings with the collators.
Can you provide your script to run this benchmark? And maybe some kind of docker compose file?