cumulus icon indicating copy to clipboard operation
cumulus copied to clipboard

Parachain gateway stuck on transaction revalidation

Open CertainLach opened this issue 2 years ago • 3 comments

We stumbled upon a problem during our transaction throughput testing on our kusama parachain: revalidating transactions stuck in a loop on gateway nodes (?).

Context

Our network consists of 7 nodes,

3 collators - parachain-collator-swe🇸🇪, parachain-collator-deu🇩🇪, parachain-collator-ita🇮🇹
3 gateways - parachain-gateway-kor🇰🇷, parachain-gateway-deu🇩🇪, parachain-gateway-usa🇺🇸
1 extra archive node - parachain-archive

Only gateways can have peerings with collators, so there is no direct connectivity between parachain-archive and collators.

There is a graph for txpool_validations_scheduled node metric

graph, described below

During our testing, we are spamming one of the gateways with a lot of balance.transfer calls, client was located in Europe in all the following steps, except (3). (Load script code: benchmark.ts · GitHub 1)

We started with parachain-gateway-usa🇺🇸 as our first target, and everything went smoothly; collators handled every transaction (first spike on the graph, 1).

Then we proceeded with parachain-gateway-kor🇰🇷, and everything went well (second spike, 2).

Then we retried with parachain-gateway-usa🇺🇸 again, but with the client located in North America, resulting in success (third spike, 3). The sender location doesn’t make a difference.

But then, interesting things started to happen.

another graph, zoomed version of first, described below

During parachain-gateway-deu🇩🇪 testing (orange line on the graph), we filled the transaction pool with transactions (4), and collators executed a couple of transactions… And then, the rest of the transactions were stuck in the loop on the gateway, revalidating and moving from Ready state to Future state and vice-versa (spiky orange line on the graph, 5).

Then we restarted parachain-gateway-kor🇰🇷 (6), collators executed another part of the initial batch of transactions, and the rest were stuck in the same loop again. Now there are two nodes in this loop: parachain-gateway-deu🇩🇪 and parachain-gateway-kor🇰🇷. (7)

Then parachain-gateway-usa🇺🇸, and the same story as with parachain-gateway-kor🇰🇷 (8)

Then parachain-archive (9), and there were no transactions executed at all. So restarting a gateway works as a pump; restarted gateway gathers some part of the transaction pool, manages to send some of them to the collator and then stops for some reason?

And finally, we restarted parachain-collator-deu🇩🇪 (10), and every transaction was finally processed (11).

This behaviour is reproducible, and we have found that adding 100ms latency to parachain-gateway-deu🇩🇪 network makes this issue go away.

Summary

parachain-gateway-deu🇩🇪 getting stuck under load, and only collator restart helps to resolve this issue. We also tested with transactions sent from Europe and North America with the same result (to parachain-gateway-deu🇩🇪 in both cases); sender location doesn’t make a difference; only transactions sent to parachain-gateway-deu🇩🇪 getting stuck.

What might be the cause of this behaviour? What can we do to prevent the malicious actor from making our gateways stuck in this state?

CertainLach avatar Mar 06 '23 13:03 CertainLach

Is that possible to reproduce locally?

Are all gateways connected each other? But always one collator being connected to only its gateway?

bkchr avatar Mar 06 '23 21:03 bkchr

This can be reproduced locally but with less consistency. It is crucial that you must have a gateway (= non-authoring node with archive pruning) in the local parachain network and send transactions to it; if the benchmark targets the collator node, no freeze will occur.

All nodes have peerings with each other, except parachain-archive, parachain-archive has no peerings with the collators.

CertainLach avatar Mar 07 '23 11:03 CertainLach

Can you provide your script to run this benchmark? And maybe some kind of docker compose file?

bkchr avatar Mar 11 '23 16:03 bkchr