py-evm Add benchmarking test suite.

Currently we don't have good benchmarking tools.

Lets write a script that does the following.

~~For each set of mainnet VM rules, build a chain of X empty blocks.~~ ✅
- ~~one using mine_block~~ ✅
- ~~another using import_block~~ ✅
~~For each set of mainnet VM rules, build a chain of X block, each full of simple value transfer transactions.~~ ✅
- ~~sender should be static across all tests~~ ✅
- ~~one run using empty to accounts.~~ ✅
- ~~one run using existing to accounts.~~ ✅
For each set of mainnet VM rules, benchmark using the open zeppelin ERC20 token contract.
- one run with X calls to transfer
- one run with X calls to approve
- one run with X calls to transferFrom
~For each set of mainnet VM rules, benchmark deployment of the open zeppelin ERC20 token contract.~ ✅
For each set of mainnet VM rules, benchmark DOS style transaction spam.
- Make X calls to a contract which uses all available gas to create empty contracts.
- Make X calls to a contract which uses all available gas on SSTORE operations.
- Two more same as above but which revert at the end of contract execution.
~For each set of mainnet VM rules, benchmark gas per second for block import.~ ✅
- ~Measure for pure value transfer transactions.~ ✅
- ~Measure for representative EVM execution (which will have to be an approximation. probably worth starting with just ERC20 contracts and transfers and then expanding from there.~ ✅
ensure blocks use pre generated fixtures with real blocks that qualify the PoW so that we don't miss to benchmark the POW validation/verification overhead
have at least one benchmark to use LevelDB rather than MemoryDB (ideally all of them but may be too big of an performance hit)

These benchmark scripts should produce detailed output to the console and be added as a stand-alone test run in our CI environment. Any failing benchmark scripts should cause CI to fail.

Stats worth tracking:

time to run
total DB size

Apr 24 '18 18:04 pipermerriam

Any failing benchmark scripts should cause CI to fail.

Where passing presumably means both correctness and within some tolerance of a performance target.

Apr 25 '18 00:04 carver

Where passing presumably means both correctness and within some tolerance of a performance target.

I don't think so. That statement was more to ensure that the benchmarks remain functional through code changes and merges as to not code rot. I don't think it's reasonable to try and code in performance targets because there are going to be big differences depending on if things are running in travis, on you local machine, etc.

Apr 25 '18 01:04 pipermerriam

Just to reassure I got it right:

These would most likely be standalone scripts not using pytest
The benchmarking aspect has nothing to do with the runtime performance but is really all about ensuring certain test scenarios still work.

Questioning my own question: What's the reason these would not just be a couple new pytest tests? The detailed console output? If yes, I think pytest could be run with -s to see all regular output.

I think that would make a good ticket for me to work on.

Apr 26 '18 11:04 cburgdorf

What's the reason these would not just be a couple new pytest tests? The detailed console output?

There are a number of parameters you might want to tweak when running a benchmark, like how many iterations to run, or which database to use. It seems unnecessarily complicated to pass those parameters in through pytest.

But (additionally) running the benchmark in pytest is valuable to make sure that the benchmark is kept up to date with any trinity API changes. I'm imagining a test as simple as:

def test_benchmarking_is_stable():
    run_benchmark(iterations=3)
    # yay, didn't crash

The benchmarking aspect has nothing to do with the runtime performance but is really all about ensuring certain test scenarios still work.

The test is to make sure the benchmark completes without errors. The benchmarking is about runtime performance.

But also this is my loose interpretation of where Piper was going, so all of this is only IMO.

Apr 26 '18 17:04 carver

Maybe clearer to talk about this in two separate components.

A benchmarking script (maybe in ./scripts/benchmark.py) that could be run with python scripts/benchmark.py which output benchmarking results to stdout. Doesn't need to be fancy, doesn't need CLI args or anything like that really for MVP.
A new test suite which imports and runs the benchmarks as part of CI to ensure that the benchmarking scripts are maintained in working order.

Benchmarks from the script are a noisy signal that we can use to make rough comparisons of performance changes as well as a really noisy data set for performance over time since we can pull data from historical CI runs.

I think this is equivalent to what @carver wrote but in my own words.

Apr 26 '18 17:04 pipermerriam

Got it. I'd be happy to give that a try. The stuff I'll learn from putting the pieces together will greatly help me for the documentation tasks. Killing two birds with one stone.

Apr 26 '18 18:04 cburgdorf

Let me play the blockchain noob again. When you say

For each set of mainnet VM rules, build a chain of X empty blocks

What different kind of sets are we talking about? Isn't the only right set as of now the following:

https://github.com/ethereum/py-evm/blob/00e494c00841e4c670fa9041cce72271c4d82355/evm/chains/mainnet/init.py#L22-L28

Do you mean we need to build chains for the chain as it was back then at Frontier fork, at the Homestead fork etc pp? If that is what is meant my follow up question would go: Why would we care about standalone configs for these past forks if the current mainnet chain incorporates these VMs anyway?

May 02 '18 13:05 cburgdorf

OnlyFrontierChain(Chain):
    vm_configuration = (
        (0, FrontierVM),
    )

Simply a chain with only a single VM in its configuration.

May 02 '18 16:05 pipermerriam

Why would we care about standalone configs for these past forks if the current mainnet chain incorporates these VMs anyway?

May 02 '18 16:05 cburgdorf

Also worth pointing out that this is the reason for the Configurable API.

Instead of having to define all of these classes just for testing you can do this.

@pytest.fixture(params=[FrontierVM, HomesteadVM, ...])
def chain_for_testing(request):
    return Chain.configure(vm_configuration=((0, request.param),))

Allows for dynamically creating alternately configured objects for dynamic scenarios like testing.

May 02 '18 16:05 pipermerriam

Why would we care about standalone configs for these past forks if the current mainnet chain incorporates these VMs anyway?

Ah, we're benchmarking any subtle differences in the different in the VM rules and how they effect performance. Each VM may have different logic for importing and finalizing blocks so we want to be able to see if one set of rules is majorly effecting performance. Without testing each VM in isolation we'd likely only be hitting either the earliest or latest VM rules, not realizing that some recent change caused the HomesteadVM to take a 10x performance hit.

May 02 '18 16:05 pipermerriam

Without testing each VM in isolation we'd likely only be hitting either the earliest or latest VM rules, not realizing that some recent change caused the HomesteadVM to take a 10x performance hit.

Nails it. Thanks!

May 02 '18 16:05 cburgdorf

@cburgdorf I added one more suite which might be the most important one on measuring gas per second for importing blocks.

May 05 '18 13:05 pipermerriam

@pipermerriam makes perfect sense. I initially planned to have a super early draft PR open this week but didn't manage in time. Now I'll be gone from Monday so I have to delay.

May 05 '18 16:05 cburgdorf

@pipermerriam I started hacking something together but it's really just a dirty pile of :hankey: at this point. Just opening the PR to for the illusion of progress ;)

https://github.com/ethereum/py-evm/pull/650

May 05 '18 22:05 cburgdorf

That is some good looking 💩

May 07 '18 16:05 pipermerriam

I re-labeled this as "Good First Issue". The ground work for this is done and it's a matter of adding further benchmarks here

I'm happy to provide guidance as good as I can.

@williambannas Might also be interesting to you?

Jun 20 '18 20:06 cburgdorf

@cburgdorf id love to help out with this issue thank you

Jun 20 '18 22:06 williambannas

ensure blocks use pre generated fixtures with real blocks that qualify the PoW so that we don't miss to benchmark the POW validation/verification overhead

Is this already satisfied? In my understanding it is, because chain.mine_block() is called which would do the validation/verification. (I think after going through it)

Or should there be another benchmark that mines empty blocks with POW including the nonce and the mix_hash?

Jul 11 '18 20:07 williambannas

@williambannas

Is this already satisfied?

It is not. We currently disable verification as seen here:

https://github.com/ethereum/py-evm/blob/8335f32888a1429f39787e236ead27e23ae92e0d/scripts/benchmark/utils/chain_plumbing.py#L87

This allows us to call mine_block() to basically anchor a block in the chain that, under real world conditions, would not be accepted because it doesn't qualify the proof of work algorithm.

The idea would be to create fixture files with real blocks that do qualify the PoW so that we don't need to disable the verification.

Jul 12 '18 08:07 cburgdorf

I think we'd benefit from benchmarking both the fast sync and state sync operations. I haven't spent much time thinking about what exactly we should benchmark here, but I think it would be valuable for us to have a good understanding of what the theoretical maximum throughput of those services can be.

cc @gsalgado

Off the top of my head:

State sync
- For each of 1-n peer connections, all of which have the desired state trie, how fast can we sync as measured by trie-nodes-per-second
Fast sync
- For each of 1-n peer connections, all of which have the chain data we want, how fast can we sync as measured by blocks/s and transactions/s using heavy blocks full of basic value transfer transactions.

Aug 01 '18 03:08 pipermerriam

This is probably done, just need to run through all the proposed benchmarks to see if any are missing (and split them out into new issues)

Dec 15 '18 01:12 carver