firefly
firefly copied to clipboard
Hardening: FireFly Hardening & Performance Testing - umbrella issue
Part of the 1.0 milestone https://github.com/hyperledger/firefly/issues/117 includes a number of performance tests on FireFly core. They are:
- Throughput and latency measurement
- Bottleneck analysis & triage
- Multi-day soak testing
- HA fail-over testing
This issue will track the ongoing efforts to complete these per protocol.
Ethereum Throughput and Latency Measurement
Goals:
- Hit FireFly with a sustained, moderate (TBD) amount of load to be confident there are no issues processing transactions.
- Initial performance benchmarks e.g. throughput, latency
Please note the initial goal of this exercise is not to slam FireFly with as much RPS as possible. It's much more important for us to be confident in FireFly's performance under somewhat normal conditions.
Interfaces targeted:
- [x] Broadcasts
- [x] Private Messages
- [x] Minting tokens
Desirable metrics:
- E2E latency
- Success ratio
- .....
Test Scenarios:
- Send pinned (broadcasts, private message, token mints) with data
- Verify each broadcast was successfully sent, transaction processed, and received.
Tools under consideration:
- https://github.com/tsenart/vegeta
- https://github.com/locustio/locust
- https://github.com/apache/jmeter
Here's an overview of the tooling being used so far.
perf.go is a small script that utilizes vegeta to simulate load. Currently, it focuses on two categories of metrics:
- The time between the last message sent and 0 pending transactions.
- How FireFly responds to load e.g. latency, failure rate, sustained RPS (Requests Per Second).
For reference, all testing is currently being done on a 2 node FireFly network, using postgres as the DB layer, generated by FireFly CLI.
Here are the some example outputs from perf.go
with extremely low RPS:
10 RPS:
Requests [total, rate, throughput] 600, 10.02, 10.01
Duration [total, attack, wait] 59.917s, 59.898s, 19.263ms
Latencies [min, mean, 50, 90, 95, 99, max] 5.517ms, 23.619ms, 12.784ms, 53.92ms, 67.832ms, 100.943ms, 159.004ms
Bytes In [total, mean] 355740, 592.90
Bytes Out [total, mean] 74400, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:600
Error Set:
Elapsed time between last sent message and 0 pending transactions: 8.011959383s
20 RPS:
Requests [total, rate, throughput] 1200, 20.02, 19.97
Duration [total, attack, wait] 1m0s, 59.95s, 153.045ms
Latencies [min, mean, 50, 90, 95, 99, max] 5.326ms, 36.946ms, 30.048ms, 77.708ms, 91.476ms, 130.316ms, 197.948ms
Bytes In [total, mean] 711476, 592.90
Bytes Out [total, mean] 148800, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:1200
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.016299835s
30 RPS:
Requests [total, rate, throughput] 1800, 30.02, 29.96
Duration [total, attack, wait] 1m0s, 59.966s, 114.93ms
Latencies [min, mean, 50, 90, 95, 99, max] 5.155ms, 77.104ms, 55.077ms, 171.714ms, 230.911ms, 353.897ms, 619.305ms
Bytes In [total, mean] 1067202, 592.89
Bytes Out [total, mean] 223200, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:1800
Error Set:
Elapsed time between last sent message and 0 pending transactions: 35.023299044s
40 RPS:
Requests [total, rate, throughput] 2400, 40.02, 39.98
Duration [total, attack, wait] 1m0s, 59.977s, 54.962ms
Latencies [min, mean, 50, 90, 95, 99, max] 4.643ms, 48.976ms, 39.996ms, 100.161ms, 119.713ms, 152.931ms, 211.914ms
Bytes In [total, mean] 1422923, 592.88
Bytes Out [total, mean] 297600, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:2400
Error Set:
Elapsed time between last sent message and 0 pending transactions: 52.081265919s
50 RPS:
Requests [total, rate, throughput] 3000, 50.02, 49.98
Duration [total, attack, wait] 1m0s, 59.979s, 41.133ms
Latencies [min, mean, 50, 90, 95, 99, max] 11.924ms, 54.866ms, 28.605ms, 72.405ms, 111.312ms, 798.671ms, 1.515s
Bytes In [total, mean] 1775629, 591.88
Bytes Out [total, mean] 372000, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m21.009812988s
The current goal is identifying what is causing the high time between the last message and 0 pending transactions. Below is a diagram of the lifecycle of a pinned message in FireFly:
data:image/s3,"s3://crabby-images/af5a5/af5a57dcc4a3b40d23219fa34a4054d357bdc5b1" alt="Screen Shot 2021-11-15 at 4 17 05 PM"
The starred box, "FireFly - Batch Pin Complete", is where batches from confirmed transactions are sequentially processed in FireFly. In the 50 RPS test, a batch was taking ~4 seconds to process, with ~120 messages per batch. The bulk of those 4 seconds was making DB queries for each message, where each message was taking roughly 25ms to process.
@peterbroadhurst is investigating a way to reduce processing time by limiting the number of DB queries per message.
With @peterbroadhurst's changes in https://github.com/hyperledger/firefly/pull/325, batch processing has been improved 4x. Taking roughly 1second, down from 4seconds, to fully process each batch.
Here are some comparison results for context. All benchmarks were ran on my MBP with a CLI generated 2 node firefly environment, using postgres as the DB:
Before #325, we were consistently seeing > 1 minute latency between the last sent message and 0 pending transactions.
Requests [total, rate, throughput] 3000, 50.02, 49.98
Duration [total, attack, wait] 1m0s, 59.979s, 41.133ms
Latencies [min, mean, 50, 90, 95, 99, max] 11.924ms, 54.866ms, 28.605ms, 72.405ms, 111.312ms, 798.671ms, 1.515s
Bytes In [total, mean] 1775629, 591.88
Bytes Out [total, mean] 372000, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m21.009812988s
Requests [total, rate, throughput] 3000, 50.02, 49.76
Duration [total, attack, wait] 1m0s, 59.98s, 303.142ms
Latencies [min, mean, 50, 90, 95, 99, max] 13.387ms, 175.692ms, 85.487ms, 292.843ms, 1.018s, 1.437s, 1.947s
Bytes In [total, mean] 1775678, 591.89
Bytes Out [total, mean] 372000, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m59.015816118s
With #325:
Duration [total, attack, wait] 59.991s, 59.977s, 14.409ms
Latencies [min, mean, 50, 90, 95, 99, max] 9.681ms, 42.796ms, 21.667ms, 62.05ms, 105.421ms, 650.269ms, 1.123s
Bytes In [total, mean] 1775694, 591.90
Bytes Out [total, mean] 372000, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 15.009402704s
Requests [total, rate, throughput] 3000, 50.02, 49.96
Duration [total, attack, wait] 1m0s, 59.977s, 64.636ms
Latencies [min, mean, 50, 90, 95, 99, max] 9.85ms, 22.499ms, 19.238ms, 36.131ms, 46.965ms, 71.564ms, 167.495ms
Bytes In [total, mean] 1775656, 591.89
Bytes Out [total, mean] 372000, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.005748929s
Requests [total, rate, throughput] 3000, 50.02, 50.01
Duration [total, attack, wait] 59.994s, 59.977s, 16.459ms
Latencies [min, mean, 50, 90, 95, 99, max] 9.671ms, 32.417ms, 18.36ms, 33.254ms, 45.816ms, 597.197ms, 1.161s
Bytes In [total, mean] 1775683, 591.89
Bytes Out [total, mean] 372000, 124.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.007021619s
Batch processing time was consistently reduced to <= 1, down from ~4.
[2021-11-17T10:29:40.179-05:00] INFO -> BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager
......
[2021-11-17T10:29:40.391-05:00] INFO <- BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager
Closing out initial broadcast tests here.
Seeing some good results with pinned private messages
Requests [total, rate, throughput] 3000, 50.02, 49.99
Duration [total, attack, wait] 1m0s, 59.978s, 28.515ms
Latencies [min, mean, 50, 90, 95, 99, max] 19.938ms, 35.958ms, 27.828ms, 59.743ms, 70.459ms, 98.082ms, 173.507ms
Bytes In [total, mean] 1994674, 664.89
Bytes Out [total, mean] 615000, 205.00
Success [ratio] 100.00%
Status Codes [code:count] 202:3000
Error Set:
Elapsed time between last message and 0 pending transactions: 1.016668037s
Requests [total, rate, throughput] 3000, 50.02, 49.98
Duration [total, attack, wait] 1m0s, 59.977s, 26.722ms
Latencies [min, mean, 50, 90, 95, 99, max] 19.739ms, 31.028ms, 27.131ms, 44.154ms, 56.449ms, 89.647ms, 158.406ms
Bytes In [total, mean] 1994131, 664.71
Bytes Out [total, mean] 615000, 205.00
Success [ratio] 99.97%
Status Codes [code:count] 202:2999 500:1
Error Set:
500 Internal Server Error
Elapsed time between last message and 0 pending transactions: 1.01232381s
Now, the focus is on testing tokens. After discussion with @awrichar, we've decided on the following order of tests:
- Mint without message
- Mint with message
- Transfer
- Burn
Currently, tokens does not batch transactions before submitting to ethconnect, so the results at high RPS should be interesting.
The above mentioned perf.go has turned into an official CLI tool🎉 . firefly-perf-cli is the HTTP load testing tool we are using for performance testing & hardening.
Since the creation of firefly-perf-cli, improvements have been made to performance testing mechanics. These include utilizing goroutines to kick off vegeta
jobs, websockets to confirm transactions, running tests in an EC2 instance rather than locally, and tracking and viewing results with Prometheus and Grafana.
Architecture
Grafana Dashboard - Broadcasts (50 tx/sec)
data:image/s3,"s3://crabby-images/81a6d/81a6d455b247d8ec973f28db0d15c4dd49f5bc87" alt="grafana"
Would be great to put this architecture diagram into the README when you get a chance 🚀 🤖
https://github.com/hyperledger/firefly/issues/376 - Database change event queue gets exhausted (Closed)
#380 - Database begin transaction failed: pq: too many clients already (Closed)
#382 - Error from data exchange - request entity too large
#383 - Token transfer has already been recorded
#406 - Mismatched nexthash or author (Fabric)