example generation regression between `6.47.0` -> `6.103.1`
Hello, we're interested in using the latest version of hypothesis but noticed a significant increase in example generation between versions:
making an issue since I've seen in the past that some regression occurred here and was the result of some necessary code fixes that sped up performance https://github.com/HypothesisWorks/hypothesis/issues/2308
pinning to 6.47.0:
batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=4244670443
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, hypothesis-6.47.0, dash-2.11.1, benchmark-4.0.0, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items] collecting ...
selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋
✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.16s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.08s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.06s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_024_FORD_BRONCO_SPORT_MK1
1.03s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.03s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_031_FORD_MUSTANG_MACH_E_MK1
1.02s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
0.99s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
0.98s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_074_HYUNDAI_IONIQ_HEV_2022
0.96s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
0.95s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3
Results (22.66s):
209 passed
on latest 6.103.1:
batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=264861645
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, dash-2.11.1, benchmark-4.0.0, hypothesis-6.103.1, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items] collecting ...
selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋
✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.70s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
1.65s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
1.62s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.61s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
1.56s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.56s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_108_KIA_NIRO_PHEV
1.51s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.51s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3
1.50s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_150_SKODA_KODIAQ_MK1
1.49s call selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_149_SKODA_KAROQ_MK1
Results (36.67s):
209 passed
Thanks for the report! It would be great if you could bisect the versions here and narrow it down to a smaller version range (ideally, a single version). We've made quite a few core engine changes starting at around 6.89.1 that could have caused this.
Looking through #2308 I see this comment about generate_new_examples https://github.com/HypothesisWorks/hypothesis/issues/2308#issuecomment-584111017, which may turn out to be prescient as we just touched that in #4007.
@tybug
There are speed regressions here and there in trying to improve randomness of the integers and floats (ex trying to hit boundary values,...), but a big speed regression when IRTree is tracked in ConjectureData from this commit https://github.com/HypothesisWorks/hypothesis/commit/1e76ce2e52e450d54470ed09b9c65fb1b598fb5c.
Can you explain more on what this tracking of IRTree does since I'm not very familiar with the codebase? and if it is possible to not having this observe condition enabled so that it doesn't have this IRTree tracking without sacrificing the quality of tests generated?
IRTree is a critical part of our new intermediate representation (#3921), which is powering improvements like faster shrinking (#3962), so we can't just turn it off! "big speed regression" after #3903 doesn't sound great, though. Thanks for narrowing this down; I'll see if I can reproduce.
import sys
import os
import time
import hypothesis.strategies as st
from hypothesis import Phase, given, settings
MAX_EXAMPLES = int(os.getenv("MAX_EXAMPLES", "60"))
def gen_empty_fingerprint():
return {i: {} for i in range(8)}
@settings(max_examples=MAX_EXAMPLES, deadline=None,
phases=(Phase.reuse, Phase.generate, Phase.shrink))
@given(data=st.data())
def get_fuzzy_car_interfaces(data):
fingerprint_strategy = st.fixed_dictionaries({key: st.dictionaries(st.integers(min_value=0, max_value=0x800),
st.integers(min_value=0, max_value=64)) for key in
gen_empty_fingerprint()})
# TODO: Add another st.fixed_dictionaries()
data.draw(fingerprint_strategy)
if __name__ == "__main__":
# first bad commit: 5de1fe84252051594fdc6879d4920c357a6d1368 - more likely to generate boundary values
# from 3.5 - 3.8s: 6e2f394a253761677cdcc0990a32df54a62f079a
# - better after revert: 6e2f394a253761677cdcc0990a32df54a62f079a
# from 4s - >5s: 1e76ce2e52e450d54470ed09b9c65fb1b598fb5c - trackIRTree in ConjectureData
lower = float(sys.argv[1])
higher = float(sys.argv[2])
num_iterations = 20
start = time.monotonic()
for _ in range(num_iterations):
get_fuzzy_car_interfaces()
print(f"{num_iterations} iterations takes {time.monotonic() - start}s")
time_taken = time.monotonic() - start
sys.exit(int(lower <= time_taken <= higher))
I have a random script here for testing. 6.47.0 is > 2x faster than 6.103.1
Still looking into this, but some initial diagnosis: it seems that LRUReusedCache.__balance is surprisingly expensive (8% runtime total), which we access often in ConjectureData._pooled_kwargs. ConjectureRunner._cache_key_ir may be another hotspot via example computation.
@bongbui321 could you give 6.108.8 a try on your production workload? On the microbenchmark, I'm down from 5s to 3.5s. I think this is still worse than it used to be, but I'm out of low-hanging fruit for the moment. I've filed https://github.com/HypothesisWorks/hypothesis/issues/4066 to follow up.
Great! I'll try it out sometime this week, and try to identify some more speed regressions post 6.47.0. Thanks for the quick fix!
FYI @tybug
The element st.dictionaries performance (with parameters as above) went down a lot with 3f82f86b3, I don't know if that's known/expected?
yeah, I think that's https://github.com/HypothesisWorks/hypothesis/issues/4066 as that commit changed mutator_groups to compute ir_starts_and_ends as an intermediary. Let's say "known now, but unexpected at the time" 😄
@tybug. I see an improvement with the new 6.108.9 compared to 6.103.1 using the example above. But when testing with my actual test, the performance is similar takes around 27s (both 6.103.0 and 6.108.9) while 6.47.0 takes around 16s for that test.
I wonder if using pytest-xdist doesn't take effect of the recent improvement you made
@bongbui321 thanks for checking. The graph in #4076 indicates we have ~recent systemic performance problems, so this may be a 'wait and see'. Sorry I couldn't be more helpful for the moment.
@bongbui321 Recent work on the engine has improved performance - I'm now seeing 2.6s on the above benchmark on 6.124.7, down from 3.4 for 6.108.9 and 4.6 for 6.100.0. 2.6 is roughly my performance for 6.47.0 as well. I'd be curious to see if these gains translate into measurable improvements in your CI time!
You may also want to keep an eye on https://github.com/HypothesisWorks/hypothesis/issues/3074, our tracking issue for compiling Hypothesis with mypyc. We expect this to give a ~1.5x speedup in the Hypothesis portion of the code.
Closing this because I think it's been fixed!