pyperformance
pyperformance copied to clipboard
Improving representative benchmarks for typing ecosystem
Due to a current lack of representative macrobenchmarks, it is very difficult to decide on whether complex accelerators for some parts of typing are worth implementing in the future. Hence, I'm trying to upstream some benchmarks into pyperformance.
IMO, there are three main areas:
- Performance of static type checkers implemented in Python (e.g. mypy). (Fixed by #102)
- Performance of programs using types at runtime (e.g. pydantic, attrs, etc.).
- Runtime overhead of typed code vs fully untyped code.
For case 2, I plan to use one of pydantic's benchmarks here https://github.com/samuelcolvin/pydantic/tree/master/benchmarks, installed without compiled binaries.
Case 3 is very tricky because there are so many ways to use typing. I don't know how often people use certain features, whether they type-hint inside tight loops, etc. So I'm struggling to find a good benchmark. An idea: grabbing one of the existing pyperformance benchmarks, fully type-hinting it, then comparing the performance delta may work.
CC @JelleZijlstra, I would greatly appreciate hearing your opinion on this (especially for case 3). Maybe I can post this on typing-sig too if I need more help.
Afterword: All 3 cases benefit from general CPython optimizations. But usually only 3. benefits greatly from typing module-only optimizations (with 1. maybe not improving much if at all, depending on implementation).
Here are some thoughts:
- mypy itself is actually also a good example of a fully typed Python codebase that's friendly to benchmarking. Comparing its performance with all types stripped out (but still in pure Python mode, no mypyc) could be interesting.
- I think there's a lot of variation in how people use typing. For example, my company's codebase very heavily uses NewTypes as annotations, but barely uses generics. But mypy doesn't use NewTypes internally but uses a lot of generics.
- Here are some aspects of static typing that could plausibly affect runtime performance
- Import time cost (in both speed and memory) of evaluating lots of annotations. This was a major part of the motivation for PEP 563 and PEP 649. The latter is still under consideration by the SC, so benchmarks could help inform a decision.
- Instantiation of generic classes. I remember this was especially an issue early on but I think Ivan (?) made some fixes later in the 3.x series. (So
class X(Generic[T]): ...
madeX()
slow.) This would be a good benchmark to add. -
cast()
andNewType()
, two of the few parts of typing you'd actually execute at runtime. These are identity functions though so there's not too much to optimize other than implementing them in C. I guess we could implementtyping.cast
in C now that we did the same forNewType.__call__
.cast()
is fairly common in mypy's codebase.
Thanks! I realized a few holes in my own ideas and your comments gave me a lot of food for thought.
mypy itself is actually also a good example of a fully typed Python codebase that's friendly to benchmarking. Comparing its performance with all types stripped out (but still in pure Python mode, no mypyc) could be interesting.
I contemplated stripping all annotations from mypy, but I'm unsure of how to get rid of the non-annotation stuff too (like cast
, NewType
, Protocol
, etc.). My goal is to bench a clean vanilla program, vs fully type-hinted code (annotations + other typing things that won't improve much due to PEP 563, 649), so that we can have a good guide on how much slower a piece of code with thorough typing will run.
I'm unsure of how to get rid of the non-annotation stuff too (like cast, NewType, Protocol, etc.).
I haven't tried this, but it may be feasible to do this with something like a LibCST codemod: cast(a, b)
gets replaced with b
, for example.
One area you wouldn't be able to get rid of would be isinstance
checks on @runtime_checkable
protocols. That's another good area to benchmark.
so that we can have a good guide on how much slower a piece of code with thorough typing will run
I'm not sure that's really a realistic idea. At runtime typing mostly does nothing, so the performance effect of adding types is really going to depend on what pieces of typing you use.
FYI, I've merged my PR that allows running benchmarks that aren't part of pyperformance. I also have a PR up against the Pyston benchmarks repo to allow them to be run using pyperformance: https://github.com/pyston/python-macrobenchmarks/pull/3.