flow-go
flow-go copied to clipboard
Loader - Variable TPS
Problem Definition
In order to better understand the limits of the network's TPS we want the loader to dynamically increase/decrease it's transaction load rather than having to run a separate test and manually update our TPS targets.
Proposed Solution
The loader starts at a small number of TPS and as we see that the network is able to gracefully handle the load it will periodically increase its load until we start seeing failures, either with failed transactions or transactions taking over a certain amount of time. Then the loader will temporarily decrease its load until we can find the balancing point where our inputTPS and outputTPS match.
Definition of Done
The loader should be able to run a CI test where it starts at a low value (e.g. 50tps) and is able to increase that value over the course of a test run. We should receive our normal BigQuery output that can be uploaded and then graphed on Grafana.
Some comments based upon own experiences running the loader for localnet load tests:
Somehow this needs to be calibrated to the variance of the TPS output from identically configured run to identically configured run. For example, let's say I run an identically configured localnet on the same 48 core box, and 10 times in a row and at 400 input TPS. Each run, the result will be no failures but an output TPS after 25 minutes of load which varies between 301 and 320 TPS. I tried this using hyper-threads off, and running on the RAM drive, and with the same setup, each run varies between 383 and 393 TPS. So higher TPS but still a variance of 10 TPS over a series of runs.
But I think this may be the most important thing to help: or transactions taking over a certain amount of time. In a clean run then the time between a TX arriving until execution will always be under 10 seconds in the load tests I have run so far. So there will be a report like this after the load test:
- tx 2 execution duration 7.x seconds occurred 149607 times
- tx 2 execution duration 8.x seconds occurred 424761 times
- tx 2 execution duration 9.x seconds occurred 26075 times
However, if flow cannot keep up with the TPS rate then the times will grow bigger the longer the load test runs for, because the excess TX are being queued and get executed ever later after they arrived.
Another thing to consider is that the collection nodes only reach 'full CPU usage' at 20 minutes in. Which is why I started running load tests for at least 25 minutes instead of 5 to 15 minutes. The idea is that the first 20 minutes is burn-in, during which time the collection node CPU will continually increase. Then from 20 to 25 minutes in the CPU usage is stabilized and presumably there is a more realistic TPS?
the loader will temporarily decrease its load is this step necessary? Why? Because if you overshoot TPS then it's double work to go to a lower TPS. Why? Because you have to go lower, and first work off any excess queued TX, and then once worked off, figure out if it's the status quo again?
The best solution I can think of is to have a two step process:
- Increase the TPS in steps of the known variance, e.g. 10 TPS. This way sometimes only 380 TPS might result, and sometimes 390 TPS might result. But if you get 370 TPS or less then something wrong might be introduced via a last PR?
- Consider running a periodic -- e.g. every night or every weekend -- set of load tests for the same flow version to discover the variance. And here we can see if a recent PR has introduced more variance, or tightened up variance.
This way you never have to decrease TPS and go through the two step process. You can just increase TPS until it becomes unstable. For example, it could run like this:
20 minutes at known working input TPS well inside range, e.g. 360. Then if no errors, TX execution errors, and time to execution is < 10 seconds, do 2 minutes at 370 TPS. Then if no errors, TX execution errors, and time to execution is < 10 seconds, do 2 minutes at 380 TPS. Then if no errors, TX execution errors, and time to execution is < 10 seconds, do 2 minutes at 390 TPS. Etc...
In this case the variance is 10 TPS and the 2 minutes (or ~ 12-ish blocks) is based on what @Kay-Zee has suggested in the past as a minimum guideline which is an example relevant period of time to collect an average output TPS from.
There could then be e.g. an ongoing investigation to try and tighten up variance to make the TPS monitoring more accurate. For example, one reason for variance could be introduced by GCP disk I/O throttling, which is likely why running on the RAM drive results in higher TPS with less variance...
Hope these comments help :-)