py-evm
py-evm copied to clipboard
feat: improve py-evm performance
What was wrong?
How was it fixed?
grab bag of optimizations. profiling (using pypy as the runtime) showed that hotspots included
functools.__call__(presumably fromfunctools.partialandfunctools.wraps)apply_computationexpand_memory- stack operations
to that end, I applied the following optimizations
- write more inlineable wrappers for DUP/SWAP/PUSH instructions
- write more inlineable ceil_XX implementations, replace with branchless implementation
- optimize the Stack data structure. instead of using List[type, value], simply use List[value] (as the type is tagged in the python runtime anyways)
- optimize apply_computation event loop
- remove usage of CodeStream iterator, inline pc logic
- remove
__enter__and__exit__, replace with inlined try/finally
- a minor peephole optimization in expand_memory, not sure if it helped much.
some of these optimizations are simple rewrites (e.g. ceil32 rewrite and stack data structure), but others (e.g. inlining CodeStream.__iter__) may break some abstractions.
execution time came down 25% in local timings, will try to make some reproductions
$ pypy3 --version
Python 3.8.13 (7.3.9+dfsg-1, Apr 01 2022, 21:41:47)
[PyPy 7.3.9 with GCC 11.2.0]
final thoughts -- according to profiling, this is most of the low hanging fruit. better performance would probably need to come from some deeper redesigns, including how opcodes are constructed / get looked up and analyzed (analysis of opcodes for runtime hotspots could be very fruitful, for instance)
Todo:
- [ ] Clean up commit history
- [ ] reproduce timings
- [ ] record profiling results
- [ ] Add entry to the release notes
Cute Animal Picture
paging @kclowes and @fselmo for initial feedback on the approach. i was thinking eventually, we will probably want to split this into two PRs - those which are relatively simple and those which may break abstractions / downstream
Hey @charles-cooper, thanks for the submission! 👍. I haven't had time to poke around too much in here yet but I see some failing core tests. Did you have any issues there? On a quick scan, looks like there's a method instance being returned in some cases where an integer result is expected. If you're expecting some tests to fail let us know too but otherwise it would be nice to address those first. I can try to take a peek here soon. Thanks again!
profiling (using pypy as the runtime)
Also, it will be important to profile using the standard cpython, because it's possible that an optimization in pypy creates a slowdown in cpython.
i reverted the changes to the Computation.apply_message routines (which involved breaking some abstractions) and just left the other improvements which didn't involve architectural changes.
@charles-cooper, do you have examples of the benchmarking you used anywhere? I'm not seeing anything significant on the test suite (pypy or cpython) but I'm also not sure that running pytest would be the best measure. It would be nice to have whatever you used to benchmark as a reference here. Or even more context on how you may have measured the optimizations.
With some quick testing on my side I'm not seeing much, negative or positive, on the cpython side of things... so if it indeed does optimize for pypy then these might still be some good changes to bring in.
@fselmo here's a benchmark using titanoboa (I used v0.1.6, because v0.1.7 and master require python>=3.10 and pypy has not yet release support for python 3.10:
(pypy3.9) ~/titanoboa $ python --version
Python 3.9.16 (7.3.11+dfsg-1~ppa1~ubuntu22.04, Dec 30 2022, 13:49:46)
[PyPy 7.3.11 with GCC 11.3.0]
(pypy3.9) ~/titanoboa $ python time_optimizations.py
overall took 1.0565147399902344s
average op time: 36.388879933534284us
(pypy3.9) ~/titanoboa $ PYTHONPATH=~/src-references/py-evm/ python time_optimizations.py # optimizations branch
overall took 0.9218192100524902s
average op time: 31.749645589739277us
# time_optimizations.py
#!/usr/bin/env python
import boa
import time
test_code = """
@external
def foo() -> uint256:
x: uint256 = 0
for i in range(1_000):
x += i
return x
"""
c = boa.loads(test_code)
t0 = time.time()
NUM_RUNS = 100
for i in range(NUM_RUNS):
c.foo()
t1 = time.time()
print(f"overall took {t1 - t0}s")
opcode_trace = c._computation.code._trace
avg_op_time = (t1 - t0) / len(opcode_trace)
avg_op_time_us = avg_op_time * 1e6
print(f"average op time: {avg_op_time_us}us")
after removing the changes to the apply_computation() event loop and applying some changes induced by mypy requirements (ex. _FastOpcode needing to implement OpcodeAPI, which caused a 1-3% perf hit). so the net in this PR is around 14%
i can also benchmark with python3.11 where i think there will be some gains too, but there is a dependency conflict here (titanoboa==0.1.6 depends on eth-abi==3.0.1 which itself does not work with python3.11) so i may have to construct a different benchmark.
with python3.11 looks like the gain is smaller, around 2%, but stable across several runs. here's the output of one run:
(python3.11) ~/titanoboa $ python time_optimizations2.py
overall took 4.019573926925659s
(python3.11) ~/titanoboa $ PYTHONPATH=~/src-references/py-evm/ python time_optimizations2.py
overall took 3.9480371475219727s
# time_optimizations2.py
import time
from eth.chains.mainnet import MainnetChain
import eth.tools.builder.chain as chain
from eth.db.atomic import AtomicDB
import eth.constants as constants
from eth.vm.transaction_context import BaseTransactionContext
from eth.vm.message import Message
GENESIS_PARAMS = {"difficulty": constants.GENESIS_DIFFICULTY, "gas_limit": int(1e8)}
_Chain = chain.build(MainnetChain, chain.latest_mainnet_at(1)).from_genesis(AtomicDB(), GENESIS_PARAMS)
vm = _Chain.get_vm()
tx_ctx = BaseTransactionContext(origin=constants.ZERO_ADDRESS, gas_price = 0)
msg = Message(
sender=constants.ZERO_ADDRESS,
to = constants.ZERO_ADDRESS,
gas=int(1e7),
value=0,
#@external
#def foo() -> uint256:
# x: uint256 = 0
# for i in range(1_000):
# x += i
# return x
code=b'`\x036\x11a\x00\x0cWa\x00ZV[_5`\xe0\x1c4a\x00^Wc\xc2\x98Ux\x81\x18a\x00XW_`@R_a\x03\xe8\x90[\x80``R`@Q``Q\x80\x82\x01\x82\x81\x10a\x00^W\x90P\x90P`@R`\x01\x01\x81\x81\x18a\x00+WPP``@\xf3[P[__\xfd[_\x80\xfd\xa1evyper\x83\x00\x03\t\x00\x0b',
data=b'\xc2\x98Ux' # method_id("foo()")
)
t0 = time.time()
NUM_RUNS = 100
for i in range(NUM_RUNS):
vm.state.computation_class.apply_message(vm.state, msg, tx_ctx)
t1 = time.time()
print(f"overall took {t1 - t0}s")
https://github.com/ethereum/py-evm/pull/2076/commits/a5f6af9010bcb15d06c44e0efbee13ee2d617bc9 fixed a hotspot in memory.write, where it was looping and writing one byte of memory at a time - the commit uses native slicing notation to batch the write. running the above benchmarks shows an additional 5% improvement(!) for both pypy and cpython