cpython Performance regression for loops in 3.12 vs 3.11

Description

I believe I've found a performance regression for large loops in Python 3.12 vs. 3.11. This effect is more pronounced when the value being stored is non-constant – with the list comprehension changed to [0 for x in range(chunk_size)], the relative difference dropped to 1.03x (still in favor of 3.11). Additionally, if the number of rows being generated is small, e.g. 1000, the difference disappears, and both versions are dead even. The results shown below were with 5,000,000 rows.

Interestingly, the array.array() performance difference was massive, but only on MacOS. On Linux, it was approximately the same as with a list. I'm not sure if this is due to the Linux installations being compiled with optimizations, hardware differences, OS differences, etc.

Environment

macOS Sonoma 14.6.1 on Apple Air M1
Debian Bullseye 12 5.10.0-32-amd64 on Xeon E5-2650 v2
Python 3.11.9, 3.12.5
- Installed via Homebrew on Mac
- Built from source on Linux with --enable-optimizations --with-lto=full --with-pkg-config=yes

Results

Mac

List

❯ hyperfine -w 20 -r 100 "python3.11 test_loops.py --num-rows 5000000" "python3.12 test_loops.py --num-rows 5000000"
Benchmark 1: python3.11 test_loops.py --num-rows 5000000
  Time (mean ± σ):     136.3 ms ±   1.3 ms    [User: 109.2 ms, System: 25.8 ms]
  Range (min … max):   133.7 ms … 142.3 ms    100 runs

Benchmark 2: python3.12 test_loops.py --num-rows 5000000
  Time (mean ± σ):     144.3 ms ±   4.2 ms    [User: 119.1 ms, System: 23.7 ms]
  Range (min … max):   138.6 ms … 170.5 ms    100 runs

Summary
  python3.11 test_loops.py --num-rows 5000000 ran
    1.06 ± 0.03 times faster than python3.12 test_loops.py --num-rows 5000000

Array

❯ hyperfine -w 20 -r 100 "python3.11 test_loops.py --num-rows 5000000" "python3.12 test_loops.py --num-rows 5000000"
Benchmark 1: python3.11 test_loops.py --num-rows 5000000
  Time (mean ± σ):     176.5 ms ±   1.3 ms    [User: 169.8 ms, System: 5.4 ms]
  Range (min … max):   174.7 ms … 185.0 ms    100 runs

Benchmark 2: python3.12 test_loops.py --num-rows 5000000
  Time (mean ± σ):     277.4 ms ±   1.6 ms    [User: 270.7 ms, System: 5.4 ms]
  Range (min … max):   274.2 ms … 283.9 ms    100 runs

Summary
  python3.11 test_loops.py --num-rows 5000000 ran
    1.57 ± 0.01 times faster than python3.12 test_loops.py --num-rows 5000000

Linux

List

❯ hyperfine -w 20 -r 100 "python3.11 test_loops.py --num-rows 5000000" "python3.12 test_loops.py --num-rows 5000000"
Benchmark 1: python3.11 test_loops.py --num-rows 5000000
  Time (mean ± σ):     378.5 ms ±  22.1 ms    [User: 260.1 ms, System: 118.3 ms]
  Range (min … max):   356.1 ms … 484.0 ms    100 runs

Benchmark 2: python3.12 test_loops.py --num-rows 5000000
  Time (mean ± σ):     405.0 ms ±  27.1 ms    [User: 283.1 ms, System: 121.9 ms]
  Range (min … max):   379.2 ms … 527.1 ms    100 runs

Summary
  python3.11 test_loops.py --num-rows 5000000 ran
    1.07 ± 0.10 times faster than python3.12 test_loops.py --num-rows 5000000

Array

❯ hyperfine -w 20 -r 100 "python3.11 test_loops.py --num-rows 5000000" "python3.12 test_loops.py --num-rows 5000000"

Benchmark 1: python3.11 test_loops.py --num-rows 5000000
  Time (mean ± σ):     387.4 ms ±  26.1 ms    [User: 263.8 ms, System: 123.5 ms]
  Range (min … max):   356.0 ms … 478.5 ms    100 runs

Benchmark 2: python3.12 test_loops.py --num-rows 5000000
  Time (mean ± σ):     408.3 ms ±  27.9 ms    [User: 284.6 ms, System: 123.6 ms]
  Range (min … max):   371.0 ms … 540.5 ms    100 runs

Summary
  python3.11 test_loops.py --num-rows 5000000 ran
    1.05 ± 0.10 times faster than python3.12 test_loops.py --num-rows 5000000

Code

from array import array
import argparse
from typing import Iterable, List


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("--chunk-size", type=int, default=250_000)
    parser.add_argument("--num-rows", type=int, default=1_000_000)

    return parser.parse_args()


def generate(num_rows: int, chunk_size: int) -> Iterable[List]:
    for i in range(0, num_rows, chunk_size):
        chunk_size = min(chunk_size, num_rows - i)

        yield generate_chunk(chunk_size)


def generate_chunk(chunk_size: int):
    return [x for x in range(chunk_size)]
    # alternate return type for testing
    # return array("I", (x for x in range(chunk_size)))


if __name__ == "__main__":
    args = get_args()

    # optionally remove the list creation and discard the results
    lst: List = []

    for chunks in generate(
        num_rows=args.num_rows,
        chunk_size=args.chunk_size,
    ):
        # optional if discarding results of generator
        # pass
        lst.append(chunks)

Linked Issues:

https://github.com/faster-cpython/ideas/discussions/319

Aug 31 '24 21:08 stephanGarland

Are you saying that the Python-only version shows the problem? If so, could you simplify your example to remove the C version? And then simplify the Python-only version down to the smallest amount of code that shows the problem.

Maybe I'm misreading, but if so, please clarify why both the C and Python versions are needed.

Thanks.

Sep 02 '24 18:09 ericvsmith

Are you saying that the Python-only version shows the problem? If so, could you simplify your example to remove the C version? And then simplify the Python-only version down to the smallest amount of code that shows the problem.

Yes, this exists in pure Python. I've changed the example, and extended the benchmarks for consistent runs of 100 each. I also ran additional tests using array.array("I") instead of list, since the linked issue discusses its use in benchmarking. Arrays had a surprising performance difference on MacOS.

Maybe I'm misreading, but if so, please clarify why both the C and Python versions are needed.

They weren't; I wasn't sure if there was some inherent difference in list access or memory allocation since in the C version, they were being filled from strides. It's now been removed.

Please let me know if there is anything else you'd like me to test or show, including compiling from source on MacOS.

Sep 02 '24 22:09 stephanGarland

Related: this is a project I have which makes heavy use of large loops (it also uses multiprocessing; this isn't intended as any kind of meaningful comparison to the above, just a related tangent). Performance seems to have continued to regress in 3.13. I also tried compiling 3.13 with JIT, and testing with and without GIL, but in all cases it was worse.

❯ hyperfine "python3.11 prototype.py --num 1000000 --filename test.csv" "python3.12 prototype.py --num 1000000 --filename test.csv" "python3.13 prototype.py --num 1000000 --filename test.csv"
Benchmark 1: python3.11 prototype.py --num 1000000 --filename test.csv
  Time (mean ± σ):      7.774 s ±  0.032 s    [User: 12.329 s, System: 1.139 s]
  Range (min … max):    7.717 s …  7.816 s    10 runs

Benchmark 2: python3.12 prototype.py --num 1000000 --filename test.csv
  Time (mean ± σ):      8.526 s ±  0.050 s    [User: 14.004 s, System: 1.292 s]
  Range (min … max):    8.437 s …  8.612 s    10 runs

Benchmark 3: python3.13 prototype.py --num 1000000 --filename test.csv
  Time (mean ± σ):      8.802 s ±  0.057 s    [User: 14.979 s, System: 1.367 s]
  Range (min … max):    8.727 s …  8.888 s    10 runs

Summary
  python3.11 prototype.py --num 1000000 --filename test.csv ran
    1.10 ± 0.01 times faster than python3.12 prototype.py --num 1000000 --filename test.csv
    1.13 ± 0.01 times faster than python3.13 prototype.py --num 1000000 --filename test.csv

I did test the listed minimal reproducible code with 3.13 though, and it's dead-even with 3.12, so my problems above are with something else unrelated. This was on the same Mac as before.

❯ hyperfine -w 20 -r 100 "python3.11 test_loops.py --num-rows 5000000" "python3.12 test_loops.py --num-rows 5000000" "python3.13 test_loops.py --num-rows 5000000"
Benchmark 1: python3.11 test_loops.py --num-rows 5000000
  Time (mean ± σ):     133.8 ms ±   1.1 ms    [User: 108.9 ms, System: 23.6 ms]
  Range (min … max):   130.2 ms … 138.7 ms    100 runs

Benchmark 2: python3.12 test_loops.py --num-rows 5000000
  Time (mean ± σ):     146.1 ms ±   0.9 ms    [User: 121.0 ms, System: 23.7 ms]
  Range (min … max):   143.8 ms … 149.1 ms    100 runs

Benchmark 3: python3.13 test_loops.py --num-rows 5000000
  Time (mean ± σ):     145.9 ms ±   0.9 ms    [User: 121.2 ms, System: 23.3 ms]
  Range (min … max):   143.7 ms … 148.0 ms    100 runs

Summary
  python3.11 test_loops.py --num-rows 5000000 ran
    1.09 ± 0.01 times faster than python3.13 test_loops.py --num-rows 5000000
    1.09 ± 0.01 times faster than python3.12 test_loops.py --num-rows 5000000

Oct 24 '24 02:10 stephanGarland

We're observing the same performance regression on our internal benchmark. Is this worked on?

Here is a thread someone digging into the root cause, and it seems related to constant object allocation, making range the usual trigger for the regression: https://stackoverflow.com/questions/77230983/why-does-it-take-longer-to-execute-a-simple-for-loop-in-python-3-12-than-in-pyth

Oct 24 '24 16:10 harrydoan-db

@harrydoan-db that seems to be correct! I modified generate_chunk() to not use range:

def generate_chunk(chunk_size: int):
    count: int = 0
    lst: List[int] = []
    while count < chunk_size:
        lst.append(count)
        count += 1
    return lst

Results below (I also tested against a self-built 3.13t free-threading / JIT-enabled version), again, on an M1 Mac Air. Interestingly, while the relative speed shifted in favor of 3.12, the absolute speed is quite a bit slower than with range(), though that could be due to the lack of pre-allocation for the list.

❯ hyperfine -w 20 -r 100 "python3.11 test_loops.py --num-rows 5000000" "python3.12 test_loops.py --num-rows 5000000" "python3.13 test_loops.py --num-rows 5000000" "python3.13t test_loops.py --num-rows 5000000"
Benchmark 1: python3.11 test_loops.py --num-rows 5000000
  Time (mean ± σ):     224.4 ms ±   1.4 ms    [User: 197.4 ms, System: 25.6 ms]
  Range (min … max):   220.4 ms … 226.9 ms    100 runs

Benchmark 2: python3.12 test_loops.py --num-rows 5000000
  Time (mean ± σ):     202.8 ms ±   0.9 ms    [User: 176.1 ms, System: 25.3 ms]
  Range (min … max):   200.7 ms … 206.0 ms    100 runs

Benchmark 3: python3.13 test_loops.py --num-rows 5000000
  Time (mean ± σ):     254.9 ms ±   0.8 ms    [User: 228.1 ms, System: 25.3 ms]
  Range (min … max):   252.9 ms … 258.1 ms    100 runs

Benchmark 4: python3.13t test_loops.py --num-rows 5000000
  Time (mean ± σ):     373.1 ms ±   1.8 ms    [User: 340.8 ms, System: 29.8 ms]
  Range (min … max):   370.0 ms … 378.9 ms    100 runs

Summary
  python3.12 test_loops.py --num-rows 5000000 ran
    1.11 ± 0.01 times faster than python3.11 test_loops.py --num-rows 5000000
    1.26 ± 0.01 times faster than python3.13 test_loops.py --num-rows 5000000
    1.84 ± 0.01 times faster than python3.13t test_loops.py --num-rows 5000000

Oct 26 '24 14:10 stephanGarland