pytest
pytest copied to clipboard
Quadratic runtime - O(n*n), n = number of tests
When doing some changes to a test report generation plugin, I was generating tests of various kinds from a script and ran it through pytest. I noticed that the runtime was non-linear. When the number of failing tests were close to 0, a couple of thousand tests were fine (<5 sec). If I on the other hand had about 10% failing tests, the runtime would be near half a minute, and after some experimenting I could see the runtime increased exponentially. Given that we have quite a few failing tests at times (and several thousand tests), this was quite noticeable.
I also ruled out that this was a function of the number of failing tests, as keeping the total the same, but doubling the percentage only shows a linear graph:
I removed the content of the conftest.py, so I think the below should be a pretty good way of recreating it.
-
[x] a detailed description of the bug or problem you are having
-
[x] output of
pip listfrom the virtual environment you are using -
[x] pytest and operating system versions
-
[x] minimal example if possible
-
Pytest 8.3.4
-
Ubuntu 24.04.2 LTS (on WSL2 Windows 11)
pip list
attrs 25.1.0
certifi 2025.1.31
charset-normalizer 3.4.1
grpcio 1.70.0
grpcio-tools 1.70.0
idna 3.10
iniconfig 2.0.0
isodate 0.7.2
line-profiler 4.2.0
lxml 5.3.1
markdown 3.7
nodeenv 1.9.1
nodejs-wheel-binaries 22.14.0
packaging 24.2
platformdirs 4.3.6
pluggy 1.5.0
protobuf 5.29.3
pyright 1.1.394
pytest 8.3.4
pytz 2025.1
requests 2.32.3
requests-file 2.1.0
requests-toolbelt 1.0.0
ruff 0.9.7
setuptools 75.8.0
typing-extensions 4.12.2
urllib3 2.3.0
zeep 4.3.1
Minimal example
run.sh
Run like FAILING=10 TOTAL=1600 ./run.sh (replace TOTAL with 200,400,600,800,1000,1200, ...)
#!/bin/bash
./generate_tests.py --failing ${FAILING:-10} --total ${TOTAL:-200} --filename generated_test.py
pytest ./generated_tests.py
generate_tests.py
#!/usr/bin/env python3
import argparse
import random
import re
def generate_test_case(name: str, should_fail: bool, marks: list[str]) -> str:
decorators = "".join([f"@pytest.mark.{mark}\n" for mark in marks])
if should_fail:
return f"{decorators}def test_{name}():\n pytest.fail(\"bogus reason\")\n"
else:
return f"{decorators}def test_{name}():\n pass\n"
def parse_mark_args(mark_arg: str, total: int) -> dict:
mark_distribution = {}
if not mark_arg:
return mark_distribution
pattern = re.compile(r"(\w+):(\d+)")
for match in mark_arg.split(','):
m = pattern.match(match.strip())
if m:
mark = m.group(1)
percentage = int(m.group(2))
count = round(total * (percentage / 100))
mark_distribution[mark] = count
return mark_distribution
def assign_marks(total: int, mark_distribution: dict) -> list[list[str]]:
mark_lists = [[] for _ in range(total)]
for mark, count in mark_distribution.items():
indices = random.sample(range(total), count)
for idx in indices:
mark_lists[idx].append(mark)
return mark_lists
def main():
parser = argparse.ArgumentParser(description="Generate bogus pytest test cases.")
parser.add_argument("--failing", type=int, default=30, help="Percentage of tests that should fail (default 30%)")
parser.add_argument("--total", type=int, required=True, help="Total number of test cases")
parser.add_argument("--filename", type=str, required=True, help="Output filename for test cases")
parser.add_argument("--mark", type=str, help="Comma-separated list of marks and percentages, e.g. foo:20,bar:10")
args = parser.parse_args()
total = args.total
failing_percentage = args.failing
failing_count = round(total * (failing_percentage / 100))
passing_count = total - failing_count
test_cases = [True] * failing_count + [False] * passing_count
random.shuffle(test_cases)
mark_distribution = parse_mark_args(args.mark, total)
mark_lists = assign_marks(total, mark_distribution)
print("Generated test suite with bogus test cases")
with open(args.filename, "w") as f:
f.write("\nimport pytest\n\n")
for i, should_fail in enumerate(test_cases):
marks = mark_lists[i]
test_code = generate_test_case(f"case_{i+1}", should_fail, marks)
f.write(test_code + "\n")
if __name__ == "__main__":
main()
Hi, I don't have time to do the analysis myself, but I always find issues like this fun.
The first step to understand the slowdown is to take a profile. You can do this like this: python -m cProfile --sort=cumtime -m pytest ./generated_tests.py. It then outputs a large table. Looking at the top 100 or so entries usually gives a good idea of where the quadratic time is spent. If you feel like doing this for your example that'd be great!
Thanks for the tip. I generated two sets of 1000 and 2000 tests (with 10% failing) and captured the reports. The runtime when doubling the tests went from 6.8 seconds to 27 seconds. The one thing that stood out was that the calls to walk the AST were 4X when the number of tests just had doubled. All of these 4x the number of calls when 2x-ing the input:
- ast.py:380(walk)
- ast.py:267(iter_child_nodes)
- ast.py:255(iter_fields)
AFAI can tell, this accounts for about 10 seconds increased runtime, if I read the chart correctly.
Most other counts of function calls were linear, as expected, though some of the function calls themselves took much longer per call (getstatementrange_ast went from 0.012 to 0.037, a tripling in time).
An observation: cutting the number of failing tests by 3/4 (from 20% to 5%) will cut total runtime by 3/4 as well (linear correlation). So the quadratic runtime increase is related total tests and I only use % of failing to exaggerate absolute differences.
Not quite sure to go from here, but at least it might be something for someone 😄
rapport.1000.txt rapport.2000.txt
As you just want to synthesize results
Try with pytest --assert=plain --tb=native as starting point
Seems like you are onto something, @RonnyPfannschmidt.
Adding --assert=plain --tb=native gives near linear increases
✦ ❯ time FAILING=50 TOTAL=400 ./generate-tests-and-report.sh Generated test suite with bogus test cases real 0m0,994s user 0m0,950s sys 0m0,041s ✦ ❯ time FAILING=50 TOTAL=800 ./generate-tests-and-report.sh Generated test suite with bogus test cases real 0m1,738s user 0m1,666s sys 0m0,071s ✦ ❯ time FAILING=50 TOTAL=1600 ./generate-tests-and-report.sh Generated test suite with bogus test cases real 0m3,612s user 0m3,481s sys 0m0,131s ✦ ❯ time FAILING=50 TOTAL=3200 ./generate-tests-and-report.sh Generated test suite with bogus test cases real 0m8,256s user 0m8,022s sys 0m0,225s ✦ ❯ time FAILING=50 TOTAL=6400 ./generate-tests-and-report.sh Generated test suite with bogus test cases real 0m22,397s user 0m21,972s sys 0m0,410s
plain assert disables ast rewrite
native tracebacks disables pytest choosen statement ranges
so pytest no longer does any ast related changes
in your specific case i'd recommend towards creating a own collector + item type to completely sidestep python related costs in pytest - no more python files, no more python functions - just directly parsing the data and turning it into test reports
So ... not quite sure what - if anything - is implied here. To me, this looks like a performance bug. I cannot understand why traversing the AST of one test should be impacted by how many tests there are. Sure, I got a good tip on overcoming the issue by avoiding it altogether, but the underlying issue still remains, no? Do you mean I should close it, @RonnyPfannschmidt, since a workaround exists?
To be clear, I am not directly bothered by it, I just thought it would be interesting to investigate it a bit and report the findings, as they seemed peculiar and point to some underlying issue.
the key problem here is that we don't share the ast cache for statement ranges between test items - so every single test item re-parses the ast for the traceback to get the statement range
this is usually a big problem as theres not that many failures and multiple files are involved
however if it happens excessive theres a related cost
a potential different workaround in your case is pytest.fail(msg, pytrace=False)
a correction in pytest would supplement lookup caching in a more complete manner (but thats tricky to get right)