cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-97912: Avoid quadratic behavior when adding LOAD_FAST_CHECK

Open sweeneyde opened this issue 3 years ago • 3 comments

  • Issue: gh-97912

sweeneyde avatar Oct 06 '22 06:10 sweeneyde

A benchmark:

from time import perf_counter
from pathlib import Path
import sympy.integrals.rubi.rules.sine as sine
text = Path(sine.__file__).read_text("utf-8")
t0 = perf_counter()
for _ in range(10):
    compile(text, "sine", "exec")
t1 = perf_counter()
print((t1 - t0) / 10)

On my machine, this goes from about 0.96 seconds before this PR to 0.32 seconds after, both on windows without PGO.

Benchmarking 3.11 (no PGO) in the same way, I also get roughly 0.32 seconds.

sweeneyde avatar Oct 06 '22 06:10 sweeneyde

The sympy file in the benchmark can be downloaded from https://raw.githubusercontent.com/sympy/sympy/master/sympy/integrals/rubi/rules/sine.py

I can confirm the benchmark goes much faster compared to main. In fact, my benchmark results are more dramatic. That's a PGO build on macOS arm64.

Main: 1.9035797520901543 With the patch: 0.4364458600000944

Benchmark at 63 locals looks like this:

Main: 0.00048459595802705736 With the patch: 0.00043200499995145946

Benchmark at 64 locals looks like this:

Main: 0.0004902609169948846 With the patch: 0.00043480941600864754

The patch is still faster in those cases.

ambv avatar Oct 07 '22 18:10 ambv

I could measure no compilation performance difference from adding fast_scan_many_locals, but it does restore some of the functionality. I ran this:

import sympy.integrals.rubi.rules.sine as sine
from collections import Counter
import dis
counts = Counter(i.opname for i in dis.get_instructions(sine.sine))
print(counts)

Before this PR:

Counter({'EXTENDED_ARG': 39840, 'LOAD_GLOBAL': 34145, 'CALL': 26331, 'LOAD_CONST': 19502, 'BINARY_OP': 14087, 'LOAD_FAST': 13861, 'STORE_FAST': 2836, 'LIST_APPEND': 1234, 'IMPORT_FROM': 368, 'RESUME': 1, 'IMPORT_NAME': 1, 'POP_TOP': 1, 'BUILD_LIST': 1, 'RETURN_VALUE': 1})

Intermediate: without fast_scan_many_locals

Counter({'EXTENDED_ARG': 39840, 'LOAD_GLOBAL': 34145, 'CALL': 26331, 'LOAD_CONST': 19502, 'BINARY_OP': 14087, 'LOAD_FAST': 8756, 'LOAD_FAST_CHECK': 5105, 'STORE_FAST': 2836, 'LIST_APPEND': 1234, 'IMPORT_FROM': 368, 'RESUME': 1, 'IMPORT_NAME': 1, 'POP_TOP': 1, 'BUILD_LIST': 1, 'RETURN_VALUE': 1})

After this PR, including fast_scan_many_locals: the same counts as before this PR. Since the sine() function has very few basicblocks (one?), fast_scan_many_locals was effective.

sweeneyde avatar Oct 07 '22 22:10 sweeneyde

:robot: New build scheduled with the buildbot fleet by @sweeneyde for commit b07183c6252988dfac0fe176101bb3b1522a2ef8 :robot:

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

bedevere-bot avatar Oct 20 '22 02:10 bedevere-bot

@iritkatriel or @markshannon, are there any objections to adding basicblock->b_unsafe_locals_mask? If not, I think this is ready.

sweeneyde avatar Oct 20 '22 18:10 sweeneyde

@iritkatriel or @markshannon, are there any objections to adding basicblock->b_unsafe_locals_mask? If not, I think this is ready.

No objections. I wondered if we can have this called from optimize_cfg (so that it can be accessed from unit tests). It would probably just require passing in the nparams and nlocals rather than the compiler (c). We can make that change later though.

iritkatriel avatar Oct 20 '22 18:10 iritkatriel