When doing some changes to a test report generation plugin, I was generating tests of various kinds from a script and ran it through pytest. I noticed that the runtime was non-linear. When the number of failing tests were close to 0, a couple of thousand tests were fine (<5 sec). If I on the other hand had about 10% failing tests, the runtime would be near half a minute, and after some experimenting I could see the runtime increased exponentially. Given that we have quite a few failing tests at times (and several thousand tests), this was quite noticeable.

I also ruled out that this was a function of the number of failing tests, as keeping the total the same, but doubling the percentage only shows a linear graph:

I removed the content of the conftest.py, so I think the below should be a pretty good way of recreating it.

[x] a detailed description of the bug or problem you are having
[x] output of pip list from the virtual environment you are using
[x] pytest and operating system versions
[x] minimal example if possible
Pytest 8.3.4
Ubuntu 24.04.2 LTS (on WSL2 Windows 11)

pip list

attrs                 25.1.0
certifi               2025.1.31
charset-normalizer    3.4.1
grpcio                1.70.0
grpcio-tools          1.70.0
idna                  3.10
iniconfig             2.0.0
isodate               0.7.2
line-profiler         4.2.0
lxml                  5.3.1
markdown              3.7
nodeenv               1.9.1
nodejs-wheel-binaries 22.14.0
packaging             24.2
platformdirs          4.3.6
pluggy                1.5.0
protobuf              5.29.3
pyright               1.1.394
pytest                8.3.4
pytz                  2025.1
requests              2.32.3
requests-file         2.1.0
requests-toolbelt     1.0.0
ruff                  0.9.7
setuptools            75.8.0
typing-extensions     4.12.2
urllib3               2.3.0
zeep                  4.3.1

Minimal example

run.sh

Run like FAILING=10 TOTAL=1600 ./run.sh (replace TOTAL with 200,400,600,800,1000,1200, ...)

#!/bin/bash
./generate_tests.py --failing ${FAILING:-10} --total ${TOTAL:-200} --filename generated_test.py
pytest ./generated_tests.py

generate_tests.py

#!/usr/bin/env python3

import argparse
import random
import re


def generate_test_case(name: str, should_fail: bool, marks: list[str]) -> str:
    decorators = "".join([f"@pytest.mark.{mark}\n" for mark in marks])
    if should_fail:
        return f"{decorators}def test_{name}():\n    pytest.fail(\"bogus reason\")\n"
    else:
        return f"{decorators}def test_{name}():\n    pass\n"


def parse_mark_args(mark_arg: str, total: int) -> dict:
    mark_distribution = {}
    if not mark_arg:
        return mark_distribution
    pattern = re.compile(r"(\w+):(\d+)")
    for match in mark_arg.split(','):
        m = pattern.match(match.strip())
        if m:
            mark = m.group(1)
            percentage = int(m.group(2))
            count = round(total * (percentage / 100))
            mark_distribution[mark] = count
    return mark_distribution


def assign_marks(total: int, mark_distribution: dict) -> list[list[str]]:
    mark_lists = [[] for _ in range(total)]
    for mark, count in mark_distribution.items():
        indices = random.sample(range(total), count)
        for idx in indices:
            mark_lists[idx].append(mark)
    return mark_lists


def main():
    parser = argparse.ArgumentParser(description="Generate bogus pytest test cases.")
    parser.add_argument("--failing", type=int, default=30, help="Percentage of tests that should fail (default 30%)")
    parser.add_argument("--total", type=int, required=True, help="Total number of test cases")
    parser.add_argument("--filename", type=str, required=True, help="Output filename for test cases")
    parser.add_argument("--mark", type=str, help="Comma-separated list of marks and percentages, e.g. foo:20,bar:10")

    args = parser.parse_args()

    total = args.total
    failing_percentage = args.failing
    failing_count = round(total * (failing_percentage / 100))
    passing_count = total - failing_count

    test_cases = [True] * failing_count + [False] * passing_count
    random.shuffle(test_cases)

    mark_distribution = parse_mark_args(args.mark, total)
    mark_lists = assign_marks(total, mark_distribution)

    print("Generated test suite with bogus test cases")

    with open(args.filename, "w") as f:
        f.write("\nimport pytest\n\n")
        for i, should_fail in enumerate(test_cases):
            marks = mark_lists[i]
            test_code = generate_test_case(f"case_{i+1}", should_fail, marks)
            f.write(test_code + "\n")


if __name__ == "__main__":
    main()

Apr 07 '25 13:04 fatso83

Hi, I don't have time to do the analysis myself, but I always find issues like this fun.

The first step to understand the slowdown is to take a profile. You can do this like this: python -m cProfile --sort=cumtime -m pytest ./generated_tests.py. It then outputs a large table. Looking at the top 100 or so entries usually gives a good idea of where the quadratic time is spent. If you feel like doing this for your example that'd be great!

Apr 07 '25 18:04 bluetech

Thanks for the tip. I generated two sets of 1000 and 2000 tests (with 10% failing) and captured the reports. The runtime when doubling the tests went from 6.8 seconds to 27 seconds. The one thing that stood out was that the calls to walk the AST were 4X when the number of tests just had doubled. All of these 4x the number of calls when 2x-ing the input:

ast.py:380(walk)
ast.py:267(iter_child_nodes)
ast.py:255(iter_fields)

AFAI can tell, this accounts for about 10 seconds increased runtime, if I read the chart correctly.

Most other counts of function calls were linear, as expected, though some of the function calls themselves took much longer per call (getstatementrange_ast went from 0.012 to 0.037, a tripling in time).

An observation: cutting the number of failing tests by 3/4 (from 20% to 5%) will cut total runtime by 3/4 as well (linear correlation). So the quadratic runtime increase is related total tests and I only use % of failing to exaggerate absolute differences.

Not quite sure to go from here, but at least it might be something for someone 😄

rapport.1000.txt rapport.2000.txt

Apr 08 '25 08:04 fatso83

As you just want to synthesize results

Try with pytest --assert=plain --tb=native as starting point

Apr 08 '25 08:04 RonnyPfannschmidt

Seems like you are onto something, @RonnyPfannschmidt.

Adding --assert=plain --tb=native gives near linear increases

✦ ❯ time FAILING=50 TOTAL=400 ./generate-tests-and-report.sh
Generated test suite with bogus test cases

real    0m0,994s
user    0m0,950s
sys     0m0,041s

✦ ❯ time FAILING=50 TOTAL=800 ./generate-tests-and-report.sh
Generated test suite with bogus test cases

real    0m1,738s
user    0m1,666s
sys     0m0,071s

✦ ❯ time FAILING=50 TOTAL=1600 ./generate-tests-and-report.sh
Generated test suite with bogus test cases

real    0m3,612s
user    0m3,481s
sys     0m0,131s

✦ ❯ time FAILING=50 TOTAL=3200 ./generate-tests-and-report.sh
Generated test suite with bogus test cases

real    0m8,256s
user    0m8,022s
sys     0m0,225s

✦ ❯ time FAILING=50 TOTAL=6400 ./generate-tests-and-report.sh
Generated test suite with bogus test cases

real    0m22,397s
user    0m21,972s
sys     0m0,410s

Apr 08 '25 09:04 fatso83

plain assert disables ast rewrite

native tracebacks disables pytest choosen statement ranges

so pytest no longer does any ast related changes

Apr 08 '25 09:04 RonnyPfannschmidt

in your specific case i'd recommend towards creating a own collector + item type to completely sidestep python related costs in pytest - no more python files, no more python functions - just directly parsing the data and turning it into test reports

Apr 08 '25 09:04 RonnyPfannschmidt

So ... not quite sure what - if anything - is implied here. To me, this looks like a performance bug. I cannot understand why traversing the AST of one test should be impacted by how many tests there are. Sure, I got a good tip on overcoming the issue by avoiding it altogether, but the underlying issue still remains, no? Do you mean I should close it, @RonnyPfannschmidt, since a workaround exists?

To be clear, I am not directly bothered by it, I just thought it would be interesting to investigate it a bit and report the findings, as they seemed peculiar and point to some underlying issue.

Apr 10 '25 09:04 fatso83

the key problem here is that we don't share the ast cache for statement ranges between test items - so every single test item re-parses the ast for the traceback to get the statement range

this is usually a big problem as theres not that many failures and multiple files are involved

however if it happens excessive theres a related cost

a potential different workaround in your case is pytest.fail(msg, pytrace=False)

a correction in pytest would supplement lookup caching in a more complete manner (but thats tricky to get right)

Apr 10 '25 09:04 RonnyPfannschmidt

pytest
pytest copied to clipboard

Quadratic runtime - O(n*n), n = number of tests

Minimal example

run.sh

generate_tests.py

pytest pytest copied to clipboard

Quadratic runtime - O(n*n), n = number of tests

Minimal example

run.sh

generate_tests.py

pytest
pytest copied to clipboard