openpilot Determine cause of hypothesis slow example generation in newer versions

Related to https://github.com/commaai/openpilot/issues/32536. This bounty is for determining why newer versions of hypothesis (6.103.1 vs our pinned 6.47) are so much slower at example generation for test_car_interfaces.py.

Bounty

Bounty is awarded for determining the exact cause and a solution for optimizing/fixing either the hypothesis library itself, or openpilot to improve the example generation times.

On AMD Ryzen Threadripper PRO 3955WX 16-Cores on hypothesis==6.47.0:

batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
  warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=4244670443
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, hypothesis-6.47.0, dash-2.11.1, benchmark-4.0.0, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items]   collecting ... 

 selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋    
                                            ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                       100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.16s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.08s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.06s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_024_FORD_BRONCO_SPORT_MK1
1.03s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.03s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_031_FORD_MUSTANG_MACH_E_MK1
1.02s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
0.99s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
0.98s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_074_HYUNDAI_IONIQ_HEV_2022
0.96s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
0.95s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3

Results (22.66s):
     209 passed

and on latest hypothesis==6.103.1

batman@workstation-shane:~/openpilot/selfdrive/car/tests$ pytest -n8 test_car_interfaces.py
/home/batman/.pyenv/versions/3.11.4/lib/python3.11/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Benchmarks are automatically disabled because xdist plugin is active.Benchmarks cannot be performed reliably in a parallelized environment.
  warner(PytestBenchmarkWarning(text))
Test session starts (platform: linux, Python 3.11.4, pytest 8.2.1, pytest-sugar 1.0.0)
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
Using --randomly-seed=264861645
rootdir: /home/batman/openpilot
configfile: pyproject.toml
plugins: timeout-2.3.1, xdist-3.6.1, cpp-2.5.0, cov-5.0.0, mock-3.14.0, forked-1.6.0, dash-2.11.1, benchmark-4.0.0, hypothesis-6.103.1, sugar-1.0.0, randomly-3.15.0, subtests-0.12.1, flaky-3.8.1, asyncio-0.23.7, anyio-4.4.0, nbmake-1.5.3, repeat-0.9.3
asyncio: mode=Mode.STRICT
8 workers [209 items]   collecting ... 

 selfdrive/car/tests/test_car_interfaces.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 56% █████▋    
                                            ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓                       100% ██████████
============================================================================ slowest 10 durations =============================================================================
1.70s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_029_FORD_F_150_MK14
1.65s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_027_FORD_FOCUS_MK4
1.62s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_025_FORD_ESCAPE_MK4
1.61s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_026_FORD_EXPLORER_MK6
1.56s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_028_FORD_F_150_LIGHTNING_MK1
1.56s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_108_KIA_NIRO_PHEV
1.51s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_030_FORD_MAVERICK_MK1
1.51s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_151_SKODA_OCTAVIA_MK3
1.50s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_150_SKODA_KODIAQ_MK1
1.49s call     selfdrive/car/tests/test_car_interfaces.py::TestCarInterfaces::test_car_interfaces_149_SKODA_KAROQ_MK1

Results (36.67s):
     209 passed

Jun 10 '24 23:06 sshane

@deanlee @BBBmau in case you guys are interested!

Jun 10 '24 23:06 sshane

Narrowed this down to a specific commit that slows down the example generation by 30% in my test (comment).

Figured out a few other slowdowns

  # first bad commit: 5de1fe84252051594fdc6879d4920c357a6d1368 - more likely to generate boundary values
  # from 3.5 - 3.8s:  e66c88d99d61c0eca0d8aed59543d35e462fef89
  #     - better after revert: 6e2f394a253761677cdcc0990a32df54a62f079a
  # from 4s - >5s: 1e76ce2e52e450d54470ed09b9c65fb1b598fb5c - trackIRTree in ConjectureData

Most of the slowdowns are for better example generation (higher chance of generating boundary values, better repro of failing tests, faster shrinker,...).To resolve this, probably need some restructuring in how hypothesis track generated cases.

Jul 29 '24 23:07 bongbui321

@sshane is anybody working on this? I would like to take it up

Apr 18 '25 14:04 raf-fonseca

this should be looked into once more after some improvements, they've closed the issue I opened that pointed out the performance regression

Apr 20 '25 21:04 BBBmau

@sshane is this still relevant?

Aug 01 '25 17:08 adeebshihadeh

Investigation Summary: Hypothesis Performance Regression (6.47.0 → 6.103.1)

I've analyzed the issue and reviewed the linked Hypothesis issue #4014 and @bongbui321's investigation. Here are the key findings:

Root Cause Analysis

The ~30% slowdown (22.66s → 36.67s, ~62% increase) in test execution time is primarily caused by commit 1e76ce2e52e450d54470ed09b9c65fb1b598fb5c which introduced IRTree tracking in ConjectureData.

Additional contributing commits identified by @bongbui321:

5de1fe84252051594fdc6879d4920c357a6d1368: Increased boundary value generation (improves test quality but adds overhead)
e66c88d99d61c0eca0d8aed59543d35e462fef89: Additional slowdown from 3.5s to 3.8s

Why This Happened

The Hypothesis maintainers intentionally added these features to improve:

Better example generation - Higher probability of finding edge cases
Improved test failure reproduction - More reliable shrinking
Enhanced debugging - Better tracking of generated values

These improvements come at a performance cost, especially for tests with complex strategies like test_car_interfaces.py.

Proposed Solutions

Option 1: Optimize Test Strategy (Recommended for openpilot)

Rather than waiting for Hypothesis changes, openpilot could optimize test strategies:

Reduce example count for CI: Use @settings(max_examples=50) instead of default 100
Implement test profiling: Identify and optimize the slowest strategy compositions
Split complex tests: Break down test_car_interfaces.py into smaller, parallelizable units
Use database caching: Ensure Hypothesis database is properly cached to avoid regenerating examples

Option 2: Selective Hypothesis Features

Add configuration to disable IRTree tracking for performance-critical tests:

from hypothesis import settings, Phase

@settings(phases=[Phase.generate, Phase.target])  # Skip shrink phase if acceptable
def test_car_interfaces(...):
    ...

Option 3: Upstream Contribution

Contribute to Hypothesis to add a performance mode that conditionally disables tracking features. However, given that the Hypothesis issue was closed as "working as intended," this may not be feasible.

Immediate Action Items

Benchmark different configurations to find the optimal balance between test coverage and speed
Profile specific car tests to identify if certain car interfaces are disproportionately slow
Consider test sharding to distribute the load more effectively with pytest-xdist

Would you like me to prepare a PR implementing Option 1 with benchmarking results?

Nov 02 '25 12:11 Just0DJ