Add benchmarking test suite.
Currently we don't have good benchmarking tools.
Lets write a script that does the following.
- ~~For each set of mainnet VM rules, build a chain of X empty blocks.~~ ✅
- ~~one using
mine_block~~ ✅ - ~~another using
import_block~~ ✅
- ~~one using
- ~~For each set of mainnet VM rules, build a chain of X block, each full of simple value transfer transactions.~~ ✅
- ~~
sendershould be static across all tests~~ ✅ - ~~one run using empty
toaccounts.~~ ✅ - ~~one run using existing
toaccounts.~~ ✅
- ~~
- For each set of mainnet VM rules, benchmark using the open zeppelin ERC20 token contract.
- one run with X calls to
transfer - one run with X calls to
approve - one run with X calls to
transferFrom
- one run with X calls to
- ~For each set of mainnet VM rules, benchmark deployment of the open zeppelin ERC20 token contract.~ ✅
- For each set of mainnet VM rules, benchmark DOS style transaction spam.
- Make X calls to a contract which uses all available gas to create empty contracts.
- Make X calls to a contract which uses all available gas on SSTORE operations.
- Two more same as above but which revert at the end of contract execution.
- ~For each set of mainnet VM rules, benchmark gas per second for block import.~ ✅
- ~Measure for pure value transfer transactions.~ ✅
- ~Measure for representative EVM execution (which will have to be an approximation. probably worth starting with just ERC20 contracts and transfers and then expanding from there.~ ✅
- ensure blocks use pre generated fixtures with real blocks that qualify the PoW so that we don't miss to benchmark the POW validation/verification overhead
- have at least one benchmark to use
LevelDBrather thanMemoryDB(ideally all of them but may be too big of an performance hit)
These benchmark scripts should produce detailed output to the console and be added as a stand-alone test run in our CI environment. Any failing benchmark scripts should cause CI to fail.
Stats worth tracking:
- time to run
- total DB size
Any failing benchmark scripts should cause CI to fail.
Where passing presumably means both correctness and within some tolerance of a performance target.
Where passing presumably means both correctness and within some tolerance of a performance target.
I don't think so. That statement was more to ensure that the benchmarks remain functional through code changes and merges as to not code rot. I don't think it's reasonable to try and code in performance targets because there are going to be big differences depending on if things are running in travis, on you local machine, etc.
Just to reassure I got it right:
- These would most likely be standalone scripts not using
pytest - The benchmarking aspect has nothing to do with the runtime performance but is really all about ensuring certain test scenarios still work.
Questioning my own question: What's the reason these would not just be a couple new pytest tests? The detailed console output? If yes, I think pytest could be run with -s to see all regular output.
I think that would make a good ticket for me to work on.
What's the reason these would not just be a couple new pytest tests? The detailed console output?
There are a number of parameters you might want to tweak when running a benchmark, like how many iterations to run, or which database to use. It seems unnecessarily complicated to pass those parameters in through pytest.
But (additionally) running the benchmark in pytest is valuable to make sure that the benchmark is kept up to date with any trinity API changes. I'm imagining a test as simple as:
def test_benchmarking_is_stable():
run_benchmark(iterations=3)
# yay, didn't crash
The benchmarking aspect has nothing to do with the runtime performance but is really all about ensuring certain test scenarios still work.
The test is to make sure the benchmark completes without errors. The benchmarking is about runtime performance.
But also this is my loose interpretation of where Piper was going, so all of this is only IMO.
Maybe clearer to talk about this in two separate components.
- A benchmarking script (maybe in
./scripts/benchmark.py) that could be run withpython scripts/benchmark.pywhich output benchmarking results to stdout. Doesn't need to be fancy, doesn't need CLI args or anything like that really for MVP. - A new test suite which imports and runs the benchmarks as part of CI to ensure that the benchmarking scripts are maintained in working order.
Benchmarks from the script are a noisy signal that we can use to make rough comparisons of performance changes as well as a really noisy data set for performance over time since we can pull data from historical CI runs.
I think this is equivalent to what @carver wrote but in my own words.
Got it. I'd be happy to give that a try. The stuff I'll learn from putting the pieces together will greatly help me for the documentation tasks. Killing two birds with one stone.
Let me play the blockchain noob again. When you say
For each set of mainnet VM rules, build a chain of X empty blocks
What different kind of sets are we talking about? Isn't the only right set as of now the following:
https://github.com/ethereum/py-evm/blob/00e494c00841e4c670fa9041cce72271c4d82355/evm/chains/mainnet/init.py#L22-L28
Do you mean we need to build chains for the chain as it was back then at Frontier fork, at the Homestead fork etc pp? If that is what is meant my follow up question would go: Why would we care about standalone configs for these past forks if the current mainnet chain incorporates these VMs anyway?
OnlyFrontierChain(Chain):
vm_configuration = (
(0, FrontierVM),
)
Simply a chain with only a single VM in its configuration.
Why would we care about standalone configs for these past forks if the current mainnet chain incorporates these VMs anyway?
Also worth pointing out that this is the reason for the Configurable API.
Instead of having to define all of these classes just for testing you can do this.
@pytest.fixture(params=[FrontierVM, HomesteadVM, ...])
def chain_for_testing(request):
return Chain.configure(vm_configuration=((0, request.param),))
Allows for dynamically creating alternately configured objects for dynamic scenarios like testing.
Why would we care about standalone configs for these past forks if the current mainnet chain incorporates these VMs anyway?
Ah, we're benchmarking any subtle differences in the different in the VM rules and how they effect performance. Each VM may have different logic for importing and finalizing blocks so we want to be able to see if one set of rules is majorly effecting performance. Without testing each VM in isolation we'd likely only be hitting either the earliest or latest VM rules, not realizing that some recent change caused the HomesteadVM to take a 10x performance hit.
Without testing each VM in isolation we'd likely only be hitting either the earliest or latest VM rules, not realizing that some recent change caused the HomesteadVM to take a 10x performance hit.
Nails it. Thanks!
@cburgdorf I added one more suite which might be the most important one on measuring gas per second for importing blocks.
@pipermerriam makes perfect sense. I initially planned to have a super early draft PR open this week but didn't manage in time. Now I'll be gone from Monday so I have to delay.
@pipermerriam I started hacking something together but it's really just a dirty pile of :hankey: at this point. Just opening the PR to for the illusion of progress ;)
https://github.com/ethereum/py-evm/pull/650
That is some good looking 💩
I re-labeled this as "Good First Issue". The ground work for this is done and it's a matter of adding further benchmarks here
I'm happy to provide guidance as good as I can.
@williambannas Might also be interesting to you?
@cburgdorf id love to help out with this issue thank you
ensure blocks use pre generated fixtures with real blocks that qualify the PoW so that we don't miss to benchmark the POW validation/verification overhead
Is this already satisfied? In my understanding it is, because chain.mine_block() is called which would do the validation/verification. (I think after going through it)
Or should there be another benchmark that mines empty blocks with POW including the nonce and the mix_hash?
@williambannas
Is this already satisfied?
It is not. We currently disable verification as seen here:
https://github.com/ethereum/py-evm/blob/8335f32888a1429f39787e236ead27e23ae92e0d/scripts/benchmark/utils/chain_plumbing.py#L87
This allows us to call mine_block() to basically anchor a block in the chain that, under real world conditions, would not be accepted because it doesn't qualify the proof of work algorithm.
The idea would be to create fixture files with real blocks that do qualify the PoW so that we don't need to disable the verification.
I think we'd benefit from benchmarking both the fast sync and state sync operations. I haven't spent much time thinking about what exactly we should benchmark here, but I think it would be valuable for us to have a good understanding of what the theoretical maximum throughput of those services can be.
cc @gsalgado
Off the top of my head:
- State sync
- For each of 1-n peer connections, all of which have the desired state trie, how fast can we sync as measured by trie-nodes-per-second
- Fast sync
- For each of 1-n peer connections, all of which have the chain data we want, how fast can we sync as measured by blocks/s and transactions/s using heavy blocks full of basic value transfer transactions.
This is probably done, just need to run through all the proposed benchmarks to see if any are missing (and split them out into new issues)