polkadot icon indicating copy to clipboard operation
polkadot copied to clipboard

Parachain block time degradation since v0.9.16 on Kusama

Open crystalin opened this issue 3 years ago • 42 comments

We are observing pretty impactful degradation on Moonriver since the deployment of client v0.9.16 on Kusama. Additionally, we have upgraded few days after Moonriver to support v0.9.16 but it had no impact.

We have been able to reproduce that on our public Alphanet (20 controlled validators, on v0.9.16) Here are the logs for a given block not being included #1690751 alphanet-block-degradation-block-1690751.log

Here is the graph showing the average time block over the last 7 days: image (The spike is only the restart of our nodes, ignore it)

crystalin avatar Feb 11 '22 21:02 crystalin

On our Alphanet, where we also observe the bug, we can see some important changes in the Validator <-> Collator peer system: image

crystalin avatar Feb 12 '22 15:02 crystalin

Looking more deeply into the connections, here are 2 of the validators (modified to have 500 collation peer set). image The surprising part is that it establish 100+ connection instantly and removes it

crystalin avatar Feb 12 '22 17:02 crystalin

I've done some test, and I suspect the nodes are getting overloaded with connections in critical times. Using unoptimized binary is making the result worse also. That would explain why Kusama network is doing worse than Polkadot at the moment

crystalin avatar Feb 12 '22 18:02 crystalin

It might be related to some optimizations @eskimor has done around collation networking

crystalin avatar Feb 12 '22 18:02 crystalin

Yeah, we should try a branch with those commits removed.

5b74841ea76195fb804a02c0336d05d9a2d3091b , 3ae71a552256273c6a82c3824451fb73423b2232, 7311a3aa9f83730a079b0899c34bb813dd96c4e1 all seem like viable culprits.

rphmeier avatar Feb 13 '22 01:02 rphmeier

https://github.com/paritytech/polkadot/pull/4640/files#r805223564 Looks suspicious.

I pushed a branch reverting this fix to https://github.com/paritytech/polkadot/tree/rh-revert-connectivity-fix if you'd like to try it out.

rphmeier avatar Feb 13 '22 01:02 rphmeier

@rphmeier Ok I'm trying it now, I'll see if this can be deployed now or will have to wait tomorrow morning

crystalin avatar Feb 13 '22 02:02 crystalin

@rphmeier Can it be applied only to validator or do the collator also need it ?

crystalin avatar Feb 13 '22 02:02 crystalin

@crystalin Specifically the collators need it. This should only affect collator logic.

That might be an argument against this being the underlying cause, because AFAIK it began even before the collators were upgraded.

rphmeier avatar Feb 13 '22 02:02 rphmeier

@rphmeier Yes it began while we were still on 0.9.13 collators. It also begin precisely when Kusama validator started to use v0.9.16 Also we can't easily upgrade all our collators to a custom version. Purestake only control 8 collators out of 80.

crystalin avatar Feb 13 '22 02:02 crystalin

We noticed increased block time since Kusama nodes upgrades to 0.9.16. We released Karura with 0.9.16 2 days ago and see block number increased more but unsure if it is related to our upgrade or more Kusama nodes have upgraded.

xlc avatar Feb 13 '22 02:02 xlc

Thank you @xlc , We also observed on some of our networks a small increase when the collators upgraded to support v0.9.16, but it was very small compared to the increase of block time since Kusama upgraded to v0.9.16

image (This is on Moonbeam/Polkadot, where some collators upgraded to support v0.9.16), The spike is the upgrade. There is a small increase in the average after it

crystalin avatar Feb 13 '22 03:02 crystalin

@rphmeier I tried to look more at the charts. I confirm the connection spikes happen when the validator get in the parachain core set. I had initially the idea that it could have been triggered by other parachains migrating to v0.9.16 on Kusama, but I don't think that is the case anymore. One thing that I'd like to verify but I don't have the full data is if the validator/collator established connection started to rise/spike when the validator was upgraded on Kusama (to confirm if it is a relay or parachain issue). Let me know if you can check that

crystalin avatar Feb 14 '22 00:02 crystalin

We can confirm that we observe a similar issue in Calamari.

stechu avatar Feb 14 '22 17:02 stechu

https://github.com/paritytech/substrate/compare/polkadot-v0.9.15...polkadot-v0.9.16

These are all Substrate changes from that period

https://github.com/paritytech/polkadot/compare/v0.9.15...v0.9.16

It doesn't look like lip2p changed at all; both versions Cargo.lock point to v0.40.0

These are all Polkadot changes.

It seems that we're looking for stuff that is relevant to the validator nodes and excluding the runtime. That narrows things down a decent bit.

  • There are a few changes to PVF execution that could be investigated.
  • There was a sync change to better support the disputes fork-choice rule. That might have had unintended side effects
  • Changes to peer-set management for full/light nodes
  • Changes to substrate-executor stack guards
  • Sync changes relevant to fast/warp sync, possibly introducing bugs
  • Maybe other stuff I'm missing

rphmeier avatar Feb 14 '22 23:02 rphmeier

@rphmeier I've double checked again today, but on our Alphanet, the degradation didn't start on the relay client upgrade v0.9.16 but the day after when we upgraded our collators to v0.9.16. This contradicts with what was observed on Moonriver/Kusama (the degradation happened 1 week before we upgraded collators to v0.9.16)

I'm not sure what could trigger those 2 scenarios. Either some networking conditions that could have been triggered on Kusama earlier due to its size, and was re-enforced by upgraded parachain clients to v0.9.16 (which would have triggered it on Alphanet).

@rphmeier is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used)

crystalin avatar Feb 15 '22 00:02 crystalin

is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used

I don't think so.

rphmeier avatar Feb 15 '22 00:02 rphmeier

@rphmeier also are there any log (we could enable debug for all modules for a short amount of time) on our validators that would help to identify the issue ?

crystalin avatar Feb 15 '22 00:02 crystalin

Can we get a graph on Grafana indicating the amount of reorgs?

rphmeier avatar Feb 15 '22 01:02 rphmeier

   1511 - 07
   1720 - 08
   1846 - 09
   1776 - 10
   1739 - 11
   1736 - 12
   1690 - 13
   1716 - 14

Those are counts per day for alphanet. The relay v0.9.16 was deployed on Feb 09, and the parachain v0.9.16 was deployed on Feb 10

Graph of all re-orgs from the collators on alphanet image Nothing seems unusual

crystalin avatar Feb 15 '22 01:02 crystalin

@rphmeier I realize... that actually the re-org message only applies in some cases, but the problematic one, where mutliple blocks of the same height (best) is received is not triggering a reorg:

Feb 15 01:50:49 moonzone alphanet[715669]: 2022-02-15 01:50:49 [🌗] 💤 Idle (50 peers), best: #1703428 (0x7662…d0a3), finalized #1703428 (0x7662…d0a3), ⬇ 2.6kiB/s ⬆ 3.4kiB/s
Feb 15 01:50:49 moonzone alphanet[715669]: 2022-02-15 01:50:49 [Relaychain] 💤 Idle (49 peers), best: #3765514 (0x6a92…ce67), finalized #3765511 (0xc386…804b), ⬇ 45.5kiB/s ⬆ 42.0kiB/s
Feb 15 01:50:50 moonzone alphanet[715669]: 2022-02-15 01:50:50 [🌗] ✨ Imported #1703429 (0x45fe…8833)
Feb 15 01:50:51 moonzone alphanet[715669]: 2022-02-15 01:50:51 [🌗] ✨ Imported #1703429 (0x8031…1937)
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] ✨ Imported #3765515 (0xecaf…b642)
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] Could not retrieve session info from rolling session window relay_parent=0xecaf63d126f6be576673de500912b82f8838505294d4acaf1d6097af218cb642
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [🌗] 💤 Idle (50 peers), best: #1703428 (0x7662…d0a3), finalized #1703428 (0x7662…d0a3), ⬇ 293.7kiB/s ⬆ 1.2MiB/s
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] 💤 Idle (49 peers), best: #3765515 (0xecaf…b642), finalized #3765512 (0x13e4…e778), ⬇ 63.1kiB/s ⬆ 30.1kiB/s

As you can see here, there are 2 Imported #1703429 with different hash, usually prevent the parachain to get the block produced/included in time. I'll see if I can count them

EDIT: I computed the number of relay import per day and it didn't change neither:

  15413 07
  18184 08
  18283 09
  18374 10
  18329 11
  18261 12
  18244 13
  18228 14

crystalin avatar Feb 15 '22 01:02 crystalin

Here's the block times for statemine - on the left is before the upgrade of relay chain and statemine runtimes and on the right is after. It's pretty clear that block production has taken a hit: Screenshot(29)

(There's probably a better graph out there but this is the one I have to hand)

gilescope avatar Feb 15 '22 09:02 gilescope

We'd like to include https://github.com/paritytech/polkadot/pull/4914 into the next release (and collators as well), even though it doesn't explain all of the issues we're observing here.

ordian avatar Feb 15 '22 14:02 ordian

Adding additional information.

We can't reproduce it with small networks (<20 collators). Here are the affected networks selected collators count (there are more collators but those are the one producing blocks):

  • Alphanet: 100 (internal relay, 21 validators)
  • Moonriver: 64 (kusama)
  • Moonbeam 53 (polkadot) (Each network has 300+ nodes)

Here is Moonbeam block time over the last days. It has also degradated: image (spikes are just node reboots, to be ignored)

It matches with the collators deploying v0.9.16 support. We are asking them to revert to previous version (v0.9.13) to see if it impacts (until runtime 9160 is deployed on Polkadot)

crystalin avatar Feb 15 '22 14:02 crystalin

Reverting many (not all) collators to v0.19.2 (using substrate v0.9.13) had an initial impact but it came back to the same degradation level as it was before: image (The vertical bar is when our nodes downgraded)

We are testing to reduce active collator from 100 to 30 on Alphanet to see if there is an impact

crystalin avatar Feb 15 '22 18:02 crystalin

Waiting a bit more. This is Moonbeam data image

crystalin avatar Feb 15 '22 20:02 crystalin

Reducing Alphanet selected Collator from 100 to 30 didn't change the block time degradation.

crystalin avatar Feb 16 '22 01:02 crystalin

We've tweaked one of our internal network (stagenet) to have 50 collators (on v0.9.16 support) but with only 8 selected to produce blocks (always the same). (The relay is 0.9.16 with 9160 westend composed of 8 validators)

The block time increased: image (Block time. The spike are simply the deployment of all the 46 added collators)

That means that even when collators are not producing blocks they are impacting the block time (@rphmeier)

crystalin avatar Feb 16 '22 02:02 crystalin

@crystalin Just to make absolutely sure we're measuring the same thing can you please share your Grafana query you're using to produce that "Block Time [60m]" graph?

koute avatar Feb 16 '22 07:02 koute

Screenshot(31)

It looks like the rollback of that PR seems to improve block times on westmint (unless I have messed up the query). This is looking at the last 7 days on westmint. The PR rolls back a noisy warning as well so the right hand warning graph makes it clear when the rolled back PR was introduced.

gilescope avatar Feb 16 '22 08:02 gilescope