hypothesis example generation regression between `6.47.0` -> `6.103.1`

Hello, we're interested in using the latest version of hypothesis but noticed a significant increase in example generation between versions:

making an issue since I've seen in the past that some regression occurred here and was the result of some necessary code fixes that sped up performance https://github.com/HypothesisWorks/hypothesis/issues/2308

pinning to 6.47.0:

batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
  warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=4244670443
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, hypothesis-6.47.0, dash-2.11.1, benchmark-4.0.0, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items]   collecting ... 

 selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋    
                                            ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                       100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.16s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.08s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.06s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_024_FORD_BRONCO_SPORT_MK1
1.03s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.03s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_031_FORD_MUSTANG_MACH_E_MK1
1.02s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
0.99s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
0.98s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_074_HYUNDAI_IONIQ_HEV_2022
0.96s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
0.95s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3

Results (22.66s):
     209 passed

on latest 6.103.1:

batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
  warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=264861645
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, dash-2.11.1, benchmark-4.0.0, hypothesis-6.103.1, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items]   collecting ... 

 selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋    
                                            ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                       100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.70s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
1.65s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
1.62s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.61s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
1.56s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.56s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_108_KIA_NIRO_PHEV
1.51s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.51s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3
1.50s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_150_SKODA_KODIAQ_MK1
1.49s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_149_SKODA_KAROQ_MK1

Results (36.67s):
     209 passed

Jun 18 '24 21:06 BBBmau

Thanks for the report! It would be great if you could bisect the versions here and narrow it down to a smaller version range (ideally, a single version). We've made quite a few core engine changes starting at around 6.89.1 that could have caused this.

Looking through #2308 I see this comment about generate_new_examples https://github.com/HypothesisWorks/hypothesis/issues/2308#issuecomment-584111017, which may turn out to be prescient as we just touched that in #4007.

Jun 18 '24 21:06 Liam-DeVoe

@tybug There are speed regressions here and there in trying to improve randomness of the integers and floats (ex trying to hit boundary values,...), but a big speed regression when IRTree is tracked in ConjectureData from this commit https://github.com/HypothesisWorks/hypothesis/commit/1e76ce2e52e450d54470ed09b9c65fb1b598fb5c.

Can you explain more on what this tracking of IRTree does since I'm not very familiar with the codebase? and if it is possible to not having this observe condition enabled so that it doesn't have this IRTree tracking without sacrificing the quality of tests generated?

Jul 29 '24 18:07 bongbui321

IRTree is a critical part of our new intermediate representation (#3921), which is powering improvements like faster shrinking (#3962), so we can't just turn it off! "big speed regression" after #3903 doesn't sound great, though. Thanks for narrowing this down; I'll see if I can reproduce.

Jul 29 '24 22:07 Liam-DeVoe

import sys
import os
import time

import hypothesis.strategies as st
from hypothesis import Phase, given, settings

MAX_EXAMPLES = int(os.getenv("MAX_EXAMPLES", "60"))

def gen_empty_fingerprint():
  return {i: {} for i in range(8)}

@settings(max_examples=MAX_EXAMPLES, deadline=None,
        phases=(Phase.reuse, Phase.generate, Phase.shrink))
@given(data=st.data())
def get_fuzzy_car_interfaces(data):
  fingerprint_strategy = st.fixed_dictionaries({key: st.dictionaries(st.integers(min_value=0, max_value=0x800),
                                                                     st.integers(min_value=0, max_value=64)) for key in
                                                gen_empty_fingerprint()})
  # TODO: Add another st.fixed_dictionaries()
  data.draw(fingerprint_strategy)

if __name__ == "__main__":
  # first bad commit: 5de1fe84252051594fdc6879d4920c357a6d1368 - more likely to generate boundary values
  # from 3.5 - 3.8s:  6e2f394a253761677cdcc0990a32df54a62f079a
  #     - better after revert: 6e2f394a253761677cdcc0990a32df54a62f079a
  # from 4s - >5s: 1e76ce2e52e450d54470ed09b9c65fb1b598fb5c - trackIRTree in ConjectureData

  lower = float(sys.argv[1])
  higher = float(sys.argv[2])

  num_iterations = 20
  start = time.monotonic()
  for _ in range(num_iterations):
    get_fuzzy_car_interfaces()

  print(f"{num_iterations} iterations takes {time.monotonic() - start}s")
  time_taken = time.monotonic() - start
  sys.exit(int(lower <= time_taken <= higher))

I have a random script here for testing. 6.47.0 is > 2x faster than 6.103.1

Jul 29 '24 22:07 bongbui321

Still looking into this, but some initial diagnosis: it seems that LRUReusedCache.__balance is surprisingly expensive (8% runtime total), which we access often in ConjectureData._pooled_kwargs. ConjectureRunner._cache_key_ir may be another hotspot via example computation.

Aug 02 '24 22:08 Liam-DeVoe

@bongbui321 could you give 6.108.8 a try on your production workload? On the microbenchmark, I'm down from 5s to 3.5s. I think this is still worse than it used to be, but I'm out of low-hanging fruit for the moment. I've filed https://github.com/HypothesisWorks/hypothesis/issues/4066 to follow up.

Aug 05 '24 00:08 Liam-DeVoe

Great! I'll try it out sometime this week, and try to identify some more speed regressions post 6.47.0. Thanks for the quick fix!

Aug 05 '24 02:08 bongbui321

FYI @tybug

The element st.dictionaries performance (with parameters as above) went down a lot with 3f82f86b3, I don't know if that's known/expected?

Aug 05 '24 13:08 jobh

yeah, I think that's https://github.com/HypothesisWorks/hypothesis/issues/4066 as that commit changed mutator_groups to compute ir_starts_and_ends as an intermediary. Let's say "known now, but unexpected at the time" 😄

Aug 05 '24 17:08 Liam-DeVoe

@tybug. I see an improvement with the new 6.108.9 compared to 6.103.1 using the example above. But when testing with my actual test, the performance is similar takes around 27s (both 6.103.0 and 6.108.9) while 6.47.0 takes around 16s for that test.

I wonder if using pytest-xdist doesn't take effect of the recent improvement you made

Aug 05 '24 23:08 bongbui321

@bongbui321 thanks for checking. The graph in #4076 indicates we have ~recent systemic performance problems, so this may be a 'wait and see'. Sorry I couldn't be more helpful for the moment.

Sep 05 '24 15:09 Liam-DeVoe

@bongbui321 Recent work on the engine has improved performance - I'm now seeing 2.6s on the above benchmark on 6.124.7, down from 3.4 for 6.108.9 and 4.6 for 6.100.0. 2.6 is roughly my performance for 6.47.0 as well. I'd be curious to see if these gains translate into measurable improvements in your CI time!

You may also want to keep an eye on https://github.com/HypothesisWorks/hypothesis/issues/3074, our tracking issue for compiling Hypothesis with mypyc. We expect this to give a ~1.5x speedup in the Hypothesis portion of the code.

Jan 29 '25 04:01 Liam-DeVoe

Closing this because I think it's been fixed!

Apr 20 '25 17:04 Zac-HD