flow-go icon indicating copy to clipboard operation
flow-go copied to clipboard

Localnet Inclusion Fees measurements

Open janezpodhostnik opened this issue 3 years ago • 6 comments

Problem Definition

This build on https://github.com/onflow/flow-go/issues/2785

Make some preliminary measurements of transaction saturation points depending on the 5 different parameters:

  • Script byte size
  • Total Arguments byte size
  • Authorizers count
  • Payload signatures count
  • Envelope signatures count

On localnet.

By doing this on localnet we can se if the saturation point depends on these parameters at all, and if the relation is approximately linear.

Definition of Done

  1. Fill the following table:
Script byte size Total Arguments byte size Authorizers count Payload signatures count Envelope signatures count Max TPS
min min min min min
max min min min min
min max min min min
min min max min min
min min min max min
min min min min max
max/2 min min min min
min max/2 min min min
min min max/2 min min
min min min max/2 min
min min min min max/2
max/2 max/2 min min min
min max/2 max/2 min min
min min max/2 max/2 min
min min min max/2 max/2
max/2 min min min max/2
  1. Also record the total byte size of the transaction at each measurement
  2. OPTIONAL: Observe if the bottleneck is always the same part of the system and which part is it?

janezpodhostnik avatar Jul 25 '22 13:07 janezpodhostnik

collected localnet metrics with preset tx sizes for stable peak TPS, and here are some early observations.

Test Machine

Apple M1 Pro\16GB mem

Metrics Collection Procedure

  1. close as many unrelated processes as possible
  2. rm -rf data profile trie bootstrap under flow-go/integration/localnet
  3. make init -> make start
  4. launch loader with const-exec as type, setting different max tx size. e.g. go run --tags relic ../loader -log-level info -tps 100 -tps-durations 800s -load-type const-exec -const-exec-max-tx-size 750
  5. Check sealed tx rate (a mirror local grafana to https://dapperlabs.grafana.net/d/PkvVJj4Mz/mainnet-general?orgId=1&refresh=10s&viewPanel=162) and get the stable peak TPS

Scope

  • only total tx size are considered for now (various params like size of tx arg, # of authorizers etc. will be explored next)
  • Q3 above Observe if the bottleneck is always the same part of the system and which part is it? will be explored next

Metrics data collected https://docs.google.com/spreadsheets/d/1eCH67Gmf9bfOHpyIghyCf8yR72aWt8D2j_OEzB8jOB4

Observations

  • Peak TPS possible on localnet on the test machine is roughly 90
  • max tx size that localnet can handle is about 150000+B. Larger tx will crash the localnet processes
  • With given tx total size, overly high TPS from loader can decrease the peak TPS from localnet. Excessive TX input can somehow occupy some resources(memory?) from localnet thus the performance of localnet can be impacted.
  • Total Tx Size vs Stable Peak TPS does not seem to form a linear correlation, and they mostly form an Inverse relation. It seems to be aligned with intuition of this calculation: (fixed avail computation resources / avg size of tx) Using scipy.optimize.curve_fit we can get: localnet_tps_curve_fit_with_eq

Misc Thoughts Seems tx comment has been passed along from Access Node until Execution Nodes. A potential optimization that can save unnecessary network traffic is to remove tx comments at Access Node. To be discussed.

Tonix517 avatar Jul 27 '22 05:07 Tonix517

Collected TPS metrics with various 1) # of authorizers; 2) # of keys of payer; 3) size of tx arg. Raw data is in https://docs.google.com/spreadsheets/d/1eCH67Gmf9bfOHpyIghyCf8yR72aWt8D2j_OEzB8jOB4/edit#gid=793558270

Observations:

  1. Size of Tx arg does not impact TPS. It makes sense because as long script\tx code is empty, it is no-op.
  2. The more the # of authorizer is, the lower the sealed tx TPS is. So does # of keys of payer, but the impact of # of authorizer is bigger than # of keys of payer.

Tonix517 avatar Jul 27 '22 22:07 Tonix517

Update: discussed above information with @janezpodhostnik:

  1. We will need to collect similar metrics from Benchnet, which is closer to real production env.
  2. After above is done, we will think about the math equation to code
  3. for tx\script compaction optimization: it is not viable at least for now bcz tx signature depends on the original tx\script code, also users will need to read their own code after publishing.
  4. for observability: follow @simonhf 's instruction to get Jaeger spans on localnet. But a good-to-have long term solution will be a Grafana diagram showing actual tx time cost breakdown info.
  5. mid\long term goals: make coefficients of the equation self-adjustable.

Tonix517 avatar Jul 28 '22 19:07 Tonix517

@Tonix517 any chance to make raw data public?

bluesign avatar Jul 29 '22 06:07 bluesign

@bluesign let me put the raw metrics here:

Total Tx Size (bytes) sealed tx rate
390 90
500 90
750 88
1000 85
2000 75
5000 40
7500 25
10000 20
20000 15
50000 6
75000 3
100000 2
125000 2
150000 2
Total Tx Size - 5000B sealed tx rate
auth1-key2-arg1000 40
auth1-key2-arg2000 40
auth1-key2-arg4000 40
   
auth1-key2-arg1000 40
auth10-key2-arg1000 37
auth20-key2-arg1000 35
auth30-key2-arg1000 32
auth40-key2-arg1000 30
auth46-key2-arg1000 27
   
auth1-key2-arg1000 40
auth1-key25-arg1000 37
auth1-key50-arg1000 35

Tonix517 avatar Jul 29 '22 16:07 Tonix517

Updates on benchnet metrics collection:

I've been following runbook here to setup and run benchnet for const-exec type of load testing, hoping to collect meaningful metrics, but unfortunately there seems to be too much noises with our current benchnet default settings mentioned in the doc. For example, tx size 1000B and 5000B all had TPS wiggling between 15-20. Tried different type of GCP instance type (highcpu, highmem, etc.), plus running loader from my local laptop, all gave similar unstable metrics.

It might be some configuration of benchnet to be tuned, but it does require extra time and effort. Since what all we need is some metrics closer to mainnet, so instead of spending more time on figuring out benchnet setting, I'd recommend starting to collect related metrics (tx size, # of authorizers, # of payer signatures etc.) from mainnet, to get real metrics. Though the concern will be the turnaround time of doing this, considering our spork practices.

@janezpodhostnik

Tonix517 avatar Aug 01 '22 23:08 Tonix517

In terms of the instability of TPS metrics from benchnet, there are more observations: https://www.notion.so/dapperlabs/2022-Aug-9-v0-27-1-Benchmark-30165199234c462ba699da42ecccf72f#03fad5854ae641f4a029523f979dfa36 by @zhangchiqing and https://axiomzen.slack.com/archives/C015G65HR2P/p1660707233633889 by @simonhf

Tonix517 avatar Aug 17 '22 17:08 Tonix517

Collected same set of metrics from canary and ran a couple of BigQueries to get more insights.

TLDR:

  1. Metrics data (stable peak sealed TPS) from canary is aligned with what we found out from localnet. The metrics is in Inverse correlation with the total size of one tx. Raw metrics data: Screen Shot 2022-08-22 at 1 53 17 PM

  2. Number of authorizers, or number of keys of payer, did have impact (roughly linear) to the metrics with the fixed size of tx. However according to our BigQuery results (as below), majority of the tx have <=10 authorizers. So it might be appropriate to ignore the number of authorizers from our equation (to be discussed) Screen Shot 2022-08-22 at 1 40 49 PM

Tonix517 avatar Aug 22 '22 20:08 Tonix517

FLIP merged at https://github.com/onflow/flips/pull/12

Tonix517 avatar Oct 13 '22 18:10 Tonix517