zombienet icon indicating copy to clipboard operation
zombienet copied to clipboard

High scale testing MVP

Open sandreim opened this issue 3 years ago • 3 comments

We're looking at writing an integration test suite that focuses on performance testing, more specifically on a list of key indicators that are covered in https://github.com/paritytech/polkadot-sdk/issues/874. The current design of Zombienet for configuration and DSL make it an easy to write tests for single digit sized networks and provides very explicit primitives for testing metrics and logs (alice: parachain 100 block height is at least 10 within 200 seconds). I'll focus on what I think we need to implement to make writing tests easy for test scenarios of an order of magnitude larger at least.

I'm breaking down everything down into two: Test configuration and the DSL.

Test configuration

In the context of higher scale, the goal is to enable the configuration to be defined in bulk, such that we don't need to talk about individual validators and their configuration (binary and args), which is cumbersome for 100 validators for example.

Where we are at

  • The validator nodes we can spawn is currently constrained by the built in identities we can use: Alice, Bob, Charlie, Dave, Ferdie, One, Two
  • All relay chain nodes need to defined individually under the relaychain section.
  • Parachains and collators need to be individually defined

Where we want to be

  • We can create as many authority key pairs as we want and include them in the genesis block so we can spawn as many validators.
  • We can define relay chain nodes in bulk by using a three step approach:
    • Total validator count is specified in the DSL file and not in the configuration file. The idea here is that by doing so, we can reuse a configuration file for different sized networks. That would require that we will use percentages to define group sizes rather than a value. Doing so might require us to have a different way of asserting metrics, that uses a percentile based approach as well (P50, P90, P99, etc).More details on how we can test metric values in this scenario can be found in the following DSL section.
    • A relay chain node configuration which specifies the binary, args
    • A relay chain node group which defines how many nodes are to be spawned in that group and which node configuration to use. We can specify either number of nodes or percentage of the total nodes

Test scenario (DSL)

The goal is to enable writing test assertions that looks at groups of validators rather than only one.

Where we are at

  • metrics and logs can be asserted using a natural language which includes a node target, a condition and a timeout.
  • code upgrades can be triggered
  • backchannel functionality can pull information from inside the nodes ? (haven't actually tried this)
  • external custom javascript scripts can be run in the context of a validator connections (scripts implement connect/run callbacks)

Where we want to be

  • Separate DSL sections with initialization/sanitiy checks
  • Ability to write loops, such that we don't duplicate test code.
  • Use a catch all matcher: All is up
  • Use a validator group matching of metrics - ValidatorGroup1: parachain 100 block height is at least 10 within 300 seconds
  • Use percentage based assertion of metrics - ValidatorGroup1(P90): reports polkadot_parachain_disputes_finality_lag is at most 1 within 300 seconds. This example will ensure at least 90% of all validators report dispute_finality lag being 1 block.
  • Logs also have the same matching features as metrics
  • Js scripts improvements - we could embed JS directly in the test file to make things more readable, but we would want to keep external support as well, but maybe as dependencies to a test file rather than a line in the test file. Some overall context at the test level (basically support for test global variables that include the network topology, enable RPC calls, logs, metrics clients and user defined variables) and Zombienet API accesses. This would reduce the need to make the natural language more complicated by implementing things that can be easily done in Javascript.
  • I'm on the fence about if and how backchannel related features would work in this scenario. Would be great to hear some input on this from @pepoviola

Issues and other improvements

I've stumbled upon some issues or missing functionality:

  • I could not match on any collator metrics or logs. The node name is collator01, but the actual name that I must reference in the test is collator01-1, but still fails
  • My test failed to parse alice: reports polkadot_pvf_preparation_time_bucket{le="1"} is at least 1
  • Full metrics parsing support - working with histograms requires inspecting two buckets and diffing as these are cumulative. Would be great to not have to do this by hand.
  • Ability to spawn nodes in parallel, rather then doing it sequentially . This would speed up test rampup time (the time it takes for everybody to "be up" and start the actual test (especially for higher scale tests).

CI integration

It doesn't seem to be a good idea to have these tests run as part of the per PR pipeline, because of the long duration and high cost of scaling the kubernetes. My proposal is to run a subset of small scale variants of the tests on the PR pipeline and run the high scale tests at release checkpoints or on a need to basis.

That being said, it looks like a lot of work, and at the same time we want to run these high scale tests sooner rather than later. My proposal is to build this incrementally starting with what I consider to be the MVP:

  • [ ] Add support to scale validators and collators/parachains (https://github.com/paritytech/zombienet/issues/78) - this should include ability to generate as many authority key pairs as needed, grouping.
  • [ ] Full metrics support (includes resolving some of issues described above) along with group matching.
  • [ ] Group log matching support
  • [ ] Javascript improvements described above
  • [ ] CI: Separate PR and Release pipelines

Link to a branch with a sample test and some comments to add more context: TBD.

sandreim avatar Jan 25 '22 11:01 sandreim

I think we need to split this up, into a PR pipeline and a release pipeline. All open issues should point to issues that add additional context that's required for implementation. Percantage based logs are a nice to have, since those tests should be rather deterministic, so this is a bit of a longer shot, but that goes hand in hand with scaling up the number of validators and grouping.

drahnr avatar Jan 25 '22 12:01 drahnr

Hi @sandreim, thanks for the feedback. I think there are several things to work in this issue but the priority is to add support for scaling the network easily right? The validation groups sounds great. Let me start working on the syntax for supporting this and we can use that as starting point.

Thanks!

pepoviola avatar Jan 25 '22 13:01 pepoviola

Hi @sandreim, thanks for the feedback. I think there are several things to work in this issue but the priority is to add support for scaling the network easily right? The validation groups sounds great. Let me start working on the syntax for supporting this and we can use that as starting point.

Thanks!

Yes, validator groups and being able to spin up many validators (not just the limited set we have now) and parachains. Other than that, launching them in parallel would also help iterate faster in development would be good to start with.

sandreim avatar Jan 25 '22 13:01 sandreim