firefly
firefly copied to clipboard
FireFly 1.3 Performance Testing
Doing performance testing ahead of the 1.3 release following the example set in https://github.com/hyperledger/firefly/issues/563#issuecomment-1123969295. It's worth noting that we don't have an RC at the moment so this is most preliminary testing, but good to get some figures on the board.
Setup is the same as in the older performance testing issue:
2 FireFly nodes on one virtual server (EC2 m4.xlarge) Entire FireFly stack is local to the server (ie both blockchains, Postgres databases, etc) Single geth node with 2 instances of ethconnect Maximum time to confirm before considering failure = 1 minute
The Plan
Here's the plan: we don't yet have an RC of 1.3, but it's not worth holding up the performance testing for that, if there's some major slow down then it's worth getting an idea of that now so we can start to work on the fixed required before we start the release process.
With that in mind, we need to run a performance run of FireFly v1.2.0 and then a run of the current main branch to see if there's a performance difference. Given this is super preliminary, it should be OK to use a shorter time frame to get a feeling for how quickly we're getting through transactions.
FireFly v1.2.0
Using the configuration below, we get the following results for an hour window of running the tests:
Expand for test configuration
nohup ./start.sh &> ffperf.log &
core-config.yml
log:
level: debug
broadcast:
batch:
size: 200
timeout: 1s
privatemessaging:
batch:
size: 200
timeout: 1s
message:
writer:
count: 5
download:
worker:
count: 100
publicstorage:
ipfs:
api:
requestTimeout: 2s
gateway:
requestTimeout: 2s
ethconnect.yml
rest:
rest-gateway:
maxTXWaitTime: 120
maxInFlight: 200
alwaysManageNonce: true
attemptGapFill: true
sendConcurrency: 3
gasEstimationFactor: 2.0
confirmations:
required: 5
debug:
port: 6000
instances.yml
stackJSONPath: /home/ubuntu/.firefly/stacks/perf-1/stack.json
wsConfig:
wsPath: /ws
readBufferSize: 16000
writeBufferSize: 16000
initialDelay: 250ms
maximumDelay: 30s
initialConnectAttempts: 5
heartbeatInterval: 5s
instances:
- name: long-run
tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
length: 500h
sender: 0
recipient: 1
messageOptions:
longMessage: false
tokenOptions:
tokenType: fungible
contractOptions: {"address": "0x69eb9c6b573415d3a8ed2fd38c0f9685a36c8455"}
FireFly git commit:
dc86491bdf1adc9f25cdb410eacfcbf0e1fc23ba
Measured TPS according to logs is hanging around ~40.26 TPS
FireFly (what will be) v1.3.0
Using the configuration below, we get the following results for an hour window of running the tests:
Expand for test configuration
nohup ./start.sh &> ffperf.log &
core-config.yml
log:
level: debug
broadcast:
batch:
size: 200
timeout: 1s
privatemessaging:
batch:
size: 200
timeout: 1s
message:
writer:
count: 5
download:
worker:
count: 100
publicstorage:
ipfs:
api:
requestTimeout: 2s
gateway:
requestTimeout: 2s
ethconnect.yml
rest:
rest-gateway:
maxTXWaitTime: 120
maxInFlight: 200
alwaysManageNonce: true
attemptGapFill: true
sendConcurrency: 3
gasEstimationFactor: 2.0
confirmations:
required: 5
debug:
port: 6000
instances.yml
stackJSONPath: /home/ubuntu/.firefly/stacks/perf-2/stack.json
wsConfig:
wsPath: /ws
readBufferSize: 16000
writeBufferSize: 16000
initialDelay: 250ms
maximumDelay: 30s
initialConnectAttempts: 5
heartbeatInterval: 5s
instances:
- name: long-run
tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
length: 500h
sender: 0
recipient: 1
messageOptions:
longMessage: false
tokenOptions:
tokenType: fungible
contractOptions: {"address": "0x457e54a0e07aae6332ffdeea59e3c5580810ba0a"}
FireFly git commit:
8b6b604d6f17396d4ba045b0221fc6e31408dcb3
Measured TPS according to logs is hanging around ~37.92 TPS
Looks like there's a ~7% difference in measured TPS between both of the runs (it's worth noting that the second run ran longer than the first, so even though the time window is an hour, the totals will still be higher). Will need to dig into specific metrics to understand where we're losing time.
Down the rabbit hole 🐰
NOTE: I'll be referring to '1.3' as what's currently in main, even though there isn't an RC
After performing 2 new runs of the performance test using the normal long-running configuration over the course of an hour, we get the following figures for inbound/outbound API response times:
| Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| http://ipfs_0:5001/api/v0/add 10.44ms 114.1ms 11.68ms 98.03ms | N/A |
| http://dataexchange_0:3000/api/v1/blobs/default/ 14.23ms 183.9ms 37.91ms 213.0ms | 1.2 |
| http://dataexchange_0:3000/api/v1/transfers 27.22ms 76.45ms 22.74ms 95.65ms | N/A |
| http://dataexchange_0:3000/api/v1/messages 9.67ms 75.15ms 5.32ms 68.73ms | N/A |
| http://evmconnect_0:5008/ 9.74ms 108.8ms 21.74ms 151.9ms | 1.2 |
| http://tokens_0_0:3000/api/v1/mint 21.94ms 98.93ms 32.28ms 137.8ms | 1.2 |
| Route (Inbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| /api/v1/namespaces/default/messages/broadcast 51.85ms 328.1ms 37.21ms 301.3ms | 1.3! |
| /api/v1/namespaces/default/messages/private 54.05ms 452.3ms 38.02ms 273.6ms | 1.3! |
| /api/v1/namespaces/default/data 63.31ms 407.5ms 66.87ms 385.1ms | N/A |
| /api/v1/namespaces/default/contracts/invoke 28.42ms 226.9ms 44.34ms 228.4ms | 1.2 |
| /api/v1/namespaces/default/tokens/mint 78.77ms 376.8ms 77.63ms 287.7ms | N/A |
Preliminary conclusions
- Calls outbound to DataExchange BLOBs API is significantly faster on
v1.2.0
than onv1.3.0
- Calls outbound to the EVM Connector are significantly faster on
v1.2.0
than onv1.3.0
- Calls outbound to the Token Connector are faster on
v1.2.0
than onv1.3.0
-
v1.3.0
performs faster on all inbound API requests thanv1.2.0
apart from `/api/v1/namespaces/default/contracts/invoke (though this could be due to token connector slow down)
*1- It's worth noting that these runs were performed for an hour, so the general trend is probably reliable, but the specific figures should not be taken as gospel.
*2- Statistics are scraped from logs from FireFly core using a custom tool which I'll contribute after some cleanup.
Given these figures it seems logical that the next areas for investigation are:
- Data Exchange BLOBs API
- EVM Connector transactions
- Tokens Connector mint operations
- Contract invoke (though this could be due to token connector slow down)
Investigation
Going through each of the areas of investigation in no particular order.
(Inbound) Contract Invoke API
| Route (Inbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| /api/v1/namespaces/default/contracts/invoke 28.42ms 226.9ms 44.34ms 228.4ms | 1.2 |
Right, so here we've gone from an average of 28.42ms
to 44.34ms
which is a ~60% increase in execution time, concerning. From the logs for each of the runs I conducted to produce the initial set of results, I've picked a transaction form both that show the difference in performance.
From v1.2.0
I have httpreq=qZ1KfQdQ
which took 21.64ms
From v1.3.0
I have httpreq=9UDOTaTZ
which took 83.89ms
1.2 logs
{"log":"[2024-02-19T19:59:15.382Z] INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383664552Z"}
{"log":"[2024-02-19T19:59:15.382Z] DEBUG SQL-\u003e begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383688129Z"}
{"log":"[2024-02-19T19:59:15.383Z] DEBUG SQL\u003c- begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383696645Z"}
{"log":"[2024-02-19T19:59:15.384Z] DEBUG SQL-\u003e insert transactions dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38455654Z"}
{"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL\u003c- insert transactions sequences=[47918] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385256478Z"}
{"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL-\u003e insert operations dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385463333Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- insert operations sequences=[138572] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.386316466Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL-\u003e lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38660824Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387084482Z"}
{"log":"[2024-02-19T19:59:15.387Z] DEBUG SQL-\u003e insert events dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387300098Z"}
{"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL\u003c- insert events sequences=[372131] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390809911Z"}
{"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL-\u003e commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390826751Z"}
{"log":"[2024-02-19T19:59:15.392Z] DEBUG SQL\u003c- commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392218175Z"}
{"log":"[2024-02-19T19:59:15.392Z] INFO Emitted transaction_submitted event 314edbfc-f82d-4a32-a58e-317c13e09842 for default:7eb9f31f-48a0-4979-85f4-008b3a63d901 (correlator=,topic=contract_invoke) dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392479498Z"}
{"log":"[2024-02-19T19:59:15.392Z] INFO Executing blockchain_invoke operation 1c2433e4-efd5-49ea-8cd5-6a8d496f8baf via handler ContractManager httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.39296641Z"}
{"log":"[2024-02-19T19:59:15.404Z] INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (21.64ms) httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.406246777Z"}
1.3 logs
{"log":"[2024-02-19T13:57:59.874Z] INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.876572405Z"}
{"log":"[2024-02-19T13:57:59.876Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.877197117Z"}
{"log":"[2024-02-19T13:57:59.893Z] INFO Executing blockchain_invoke operation ec41af8e-51d7-454c-a3a7-e7d95d4549ae via handler ContractManager httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894025396Z"}
{"log":"[2024-02-19T13:57:59.893Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894051915Z"}
{"log":"[2024-02-19T13:57:59.894Z] DEBUG EVMConnectorBody: {\"from\":\"0x4c765d85923692812d4a9c72e599b5565e49d943\",\"headers\":{\"type\":\"SendTransaction\",\"id\":\"default:ec41af8e-51d7-454c-a3a7-e7d95d4549ae\"},\"method\":{\"type\":\"function\",\"name\":\"set\",\"inputs\":[{\"name\":\"newValue\",\"type\":\"uint256\"}],\"outputs\":[]},\"params\":[164],\"to\":\"0x457e54a0e07aae6332ffdeea59e3c5580810ba0a\"} httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894158763Z"}
{"log":"[2024-02-19T13:57:59.960Z] INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (83.89ms) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.962258998Z"}
Noting the big difference in logs shows that there's definitely some implementation changes here for us to dig into... Time to dig into the code...
Bingpot! - Changes within the last 8 months.... Which is this PR: https://github.com/hyperledger/firefly/pull/1354, curiously labelled 'Further optimize blockchain transaction inserts to DB'. Think there's more to the story here to uncover...
(Outbound) Token Connector API Mint
| Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| http://tokens_0_0:3000/api/v1/mint 21.94ms 98.93ms 32.28ms 137.8ms | 1.2 |
Looks like an ~11ms
change, need to dig into the token connector to figure out what has changed recently. A walk through the code shows some changes to how we submit transactions onto the blockchain which is the prime suspect at the moment: https://github.com/hyperledger/firefly-tokens-erc20-erc721/pull/128
The suspected PR is implementing retry logic for blockchain calls which opens us to the possibility that some of the calls made during the test failed and we've retried them where previously we would have allowed them to fail. Might need to conduct another run and collect logs from the tokens connector to figure out if we're actually retrying some calls.
Given I can't (yet) isolate a specific reason for the slow down, I'm going to run some quick tests that are focused exclusively on token mint operations to see if I can pick out if there is a specific issue here. I'm hoping that running these in isolation will show very similar results.
(Outbound) POST EVM Connector
| Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| http://evmconnect_0:5008/ 9.74ms 108.8ms 21.74ms 151.9ms | 1.2 |
Into the EVM Connector code - might be a bit tricky since to my knowledge it doesn't use the same API structure as the rest of the applications...
So after going through the code, it looks like we're doing a lot of calls for gas estimation (makes sense really), the only PR in that space that has gone in since 1.2 looks to be https://github.com/hyperledger/firefly-evmconnect/pull/85 but there's nothing immediately obviously to be the cause of the slow down.
(Outbound) Data Exchange BLOBs API
| Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| http://dataexchange_0:3000/api/v1/blobs/default/ 14.23ms 183.9ms 37.91ms 213.0ms | 1.2 |
Average slow down looks to be ~23ms
aka significant. This one is a bit weird, though because it looks like repo itself hasn't had any changes for a long period of time... https://github.com/hyperledger/firefly-dataexchange-https.
Additionally, looking into the code shows that everything in there hasn't changed since 1.2. Not sure I understand enough right now to be able to comment on what the source of the slow down could be, but that'll be the next thing to dig into.
So after going through and doing cursory investigation of all of these, I think there might be a legitimate slowing on the /contracts/invoke
API but the other APIs don't make a whole lot of sense as to why there's a discrepancy in the results, so I think we're at the point where we should run a much longer test and then observe what the results are at the end those runs. While those runs are going it should be possible to investigate a bit more what's going on with the contracts API.
Long-Running Test
Previously, tests have been running for < a couple of hours at most, we're at the point of now needing to run longer tests to observe what transaction speed/throughput is looking like. We're going to the run the same suite as run for 1.2 for at least 4-5 days and then compare results.
Configuration
nohup ./start.sh &> ffperf.log &
core-config.yml
log:
level: debug
broadcast:
batch:
size: 200
timeout: 1s
privatemessaging:
batch:
size: 200
timeout: 1s
message:
writer:
count: 5
download:
worker:
count: 100
publicstorage:
ipfs:
api:
requestTimeout: 2s
gateway:
requestTimeout: 2s
ethconnect.yml
rest:
rest-gateway:
maxTXWaitTime: 120
maxInFlight: 200
alwaysManageNonce: true
attemptGapFill: true
sendConcurrency: 3
gasEstimationFactor: 2.0
confirmations:
required: 5
debug:
port: 6000
instances.yml
stackJSONPath: /home/ubuntu/.firefly/stacks/perf-test/stack.json
wsConfig:
wsPath: /ws
readBufferSize: 16000
writeBufferSize: 16000
initialDelay: 250ms
maximumDelay: 30s
initialConnectAttempts: 5
heartbeatInterval: 5s
instances:
- name: long-run
tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
length: 500h
sender: 0
recipient: 1
messageOptions:
longMessage: false
tokenOptions:
tokenType: fungible
contractOptions: {"address": "0x3f80fa2a99168245a54fabb774ec3ddc7e71419e"}
FireFly git commit:
8b6b604d6f17396d4ba045b0221fc6e31408dcb3
Picking this up - I see a similar behaviour where the pending for broadcasts just keeps getting bigger and bigger and it's because the batch pins get stuck by a previous one and they are never confirmed. For some reason the performance CLI thinks they are confirmed and even a higher number than sent!! I think it's because on rewind it receives some sort of message_confirmed by accident... Digging into this
Have posted on Hardening 1.3 release the results from a run, the problem with the difference in confirmed over submitted was due to the metrics not being correctly added, should be fixed by https://github.com/hyperledger/firefly/pull/1490
This is done