cpython
cpython copied to clipboard
gh-94603: micro optimize list.pop
This PR optmizes list.pop
Benchmark results
x=[1,2]; x.pop(0): Mean +- std dev: [m2] 81.8 ns +- 2.5 ns -> [pr_special] 68.9 ns +- 2.0 ns: 1.19x faster
x=[1,2]; x.pop(1): Mean +- std dev: [m2] 65.4 ns +- 2.1 ns -> [pr_special] 64.4 ns +- 0.6 ns: 1.02x faster
x=[]; x.insert(0,1); x.pop(0): Mean +- std dev: [m2] 111 ns +- 3 ns -> [pr_special] 80.2 ns +- 1.1 ns: 1.38x faster
list(0): insert(0, None) + pop(0): Mean +- std dev: [m2] 78.4 ns +- 0.8 ns -> [pr_special] 62.1 ns +- 1.3 ns: 1.26x faster
list(1): insert(0, None) + pop(1): Mean +- std dev: [m2] 59.7 ns +- 0.4 ns -> [pr_special] 57.7 ns +- 0.9 ns: 1.04x faster
list(2): insert(0, None) + pop(1): Mean +- std dev: [m2] 66.9 ns +- 0.6 ns -> [pr_special] 55.5 ns +- 1.8 ns: 1.20x faster
list(10): insert(0, None) + pop(1): Mean +- std dev: [m2] 71.9 ns +- 0.6 ns -> [pr_special] 62.4 ns +- 2.8 ns: 1.15x faster
list(100): insert(0, None) + pop(1): Mean +- std dev: [m2] 108 ns +- 4 ns -> [pr_special] 93.5 ns +- 3.3 ns: 1.16x faster
list(10000): insert(0, None) + pop(1): Mean +- std dev: [m2] 5.41 us +- 0.02 us -> [pr_special] 5.37 us +- 0.02 us: 1.01x faster
Benchmark hidden because not significant (2): x=[1]; x.pop(0), sortedcontainer
Geometric mean: 1.12x faster
Benchmark code
import pyperf
import sortedcontainers
runner = pyperf.Runner()
runner.timeit(name=f"x=[1]; x.pop(0)",stmt='x=[1]; x.pop(0)')
runner.timeit(name=f"x=[1,2]; x.pop(0)", stmt='x=[1,2]; x.pop(0)')
runner.timeit(name=f"x=[1,2]; x.pop(1)",stmt='x=[1,2]; x.pop(1)')
runner.timeit(name=f"x=[]; x.insert(0,1); x.pop(0)",stmt='x=[]; x.insert(0,1); x.pop(0)')
runner.timeit(name=f"list(0): insert(0, None) + pop(0)", stmt="lst.insert(0, None); lst.pop(0)", setup='lst=[]')
for n in [ 1, 2, 10, 100, 10_000]:
runner.timeit(name=f"list({n}): insert(0, None) + pop(1)",
stmt="lst.insert(0, None); lst.pop(1)", setup=f'import random; lst=[random.random() for ii in range({n})]')
setup ="import random; import sortedcontainers; sc=sortedcontainers.SortedList(); x=[random.random() for n in range(1_00_000)]"
runner.timeit(name=f"sortedcontainer", stmt="[sc.add(v) for v in x]", setup=setup)
- Issue: gh-94603
@serhiy-storchaka is the expert who can decide these kinds of micro-optimization is worth applying :)
I am a bit surprised to see a significant difference in that microbenchmarks. The method call adds an overhead, and since this method is not idempotent, you need to add even more expensive code to create a list or adding an item in the existing list list at every iteration, so I expected the difference be rather unnoticeable. But I repeated microbenchmarking myself and confirm the results. It looks interesting.
Now, we need to analyze, why we get such significant speedup, can the optimization be generalized to other methods, can we get the same results with simpler changes?
@serhiy-storchaka Here are some more details
- The
test/test_list.pyis not part of the PGO cycle. Thelist.popis called multiple times though, so it is part of the profiling so I did not change that - The specialization for popping the last element of a list is required to achieve the same performance as current
main. Benchmarks onrunner.timeit(name=f"pop last item",stmt='x=[1,2,3]; x.pop(-1)'):
Main
pop last item: Mean +- std dev: 68.2 ns +- 0.8 ns
PR
pop last item: Mean +- std dev: 65.0 ns +- 1.1 ns
PR: No specialization for last element
pop last item: Mean +- std dev: 73.6 ns +- 0.5 ns
- The specialization for empty lists is required as well. Benchmark
runner.timeit(name=f"pop to empty",stmt='x=[1,]; x.pop(0)'):
Main
pop to empty: Mean +- std dev: 64.0 ns +- 0.4 ns
PR
pop to empty: Mean +- std dev: 61.5 ns +- 0.9 ns
PR: No specialization for last element
pop to empty: Mean +- std dev: 64.5 ns +- 0.4 ns
(differences are small though)
- The reason for the speedup might be that we avoid an extra call to
memcopyinlist_ass_slice(e.g. we replacememcpy(recycle, &item[ilow], s);with a direct assignment)
Added similar optimizations to list_ass_slice does not seem to have the same performance benefits. Not sure why that is
https://github.com/python/cpython/compare/main...eendebakpt:list_assignment_optimization
results in:
x=[1,2]; x.pop(0): Mean +- std dev: [m5] 80.8 ns +- 1.1 ns -> [pr_slice3] 79.3 ns +- 3.2 ns: 1.02x faster
x=[1,2]; x.pop(1): Mean +- std dev: [m5] 65.0 ns +- 0.4 ns -> [pr_slice3] 66.7 ns +- 2.1 ns: 1.03x slower
list(0): insert(0, None) + pop(0): Mean +- std dev: [m5] 79.1 ns +- 0.6 ns -> [pr_slice3] 83.8 ns +- 0.5 ns: 1.06x slower
list(50): x.append(1); x.pop(-2): Mean +- std dev: [m5] 48.1 ns +- 1.4 ns -> [pr_slice3] 45.6 ns +- 0.3 ns: 1.06x faster
list(1): insert(0, None) + pop(1): Mean +- std dev: [m5] 57.6 ns +- 1.0 ns -> [pr_slice3] 61.7 ns +- 0.3 ns: 1.07x slower
list(2): insert(0, None) + pop(1): Mean +- std dev: [m5] 64.8 ns +- 0.5 ns -> [pr_slice3] 68.0 ns +- 2.2 ns: 1.05x slower
list(10): insert(0, None) + pop(1): Mean +- std dev: [m5] 69.1 ns +- 1.0 ns -> [pr_slice3] 73.3 ns +- 1.8 ns: 1.06x slower
list(100): insert(0, None) + pop(1): Mean +- std dev: [m5] 119 ns +- 2 ns -> [pr_slice3] 108 ns +- 4 ns: 1.11x faster
list(10000): insert(0, None) + pop(1): Mean +- std dev: [m5] 5.82 us +- 0.03 us -> [pr_slice3] 5.40 us +- 0.16 us: 1.08x faster
sortedcontainer: Mean +- std dev: [m5] 105 ms +- 1 ms -> [pr_slice3] 110 ms +- 4 ms: 1.05x slower
Benchmark hidden because not significant (2): x=[1]; x.pop(0), x=[]; x.insert(0,1); x.pop(0)
Geometric mean: 1.00x slower
Added similar optimizations to
list_ass_slicedoes not seem to have the same performance benefits. Not sure why that is
A small variant https://github.com/python/cpython/compare/main...eendebakpt:list_assignment_optimization2 also gives performance similar to main. Avoiding a single memcpy does not fully explain the performance gain.
The list_ass_slice does more work than is required for the case of a simple pop. When calling from list_pop_impl there are 7 redundant if statements. E.g. if-statements that are required for the implementation of list_ass_slice, but that would not be required if the method is only called from list_pop_impl.
The additional 7 if statements (and avoided memcpy) can perhaps explain the performance difference we see. So the performance gain is not due to a single improvement, but a combination of several small gains added together.
@serhiy-storchaka
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.
Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.
I have made the requested changes; please review again
Thanks for making the requested changes!
@corona10: please review the changes made to this pull request.
:robot: New build scheduled with the buildbot fleet by @corona10 for commit 7c1a8355d7ca723039d05546d171a2359644ce1c :robot:
If you want to schedule another build, you need to add the :hammer: test-with-refleak-buildbots label again.
:robot: New build scheduled with the buildbot fleet by @corona10 for commit 23eb3aa79c6e0fe9d0bcf4a969adb975f6d40bae :robot:
If you want to schedule another build, you need to add the :hammer: test-with-refleak-buildbots label again.
:warning::warning::warning: Buildbot failure :warning::warning::warning:
Hi! The buildbot ARM Raspbian 3.x has failed when building commit b3da6989524001d707ebb386bfab998452aa57bb.
What do you need to do:
- Don't panic.
- Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
- Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/424/builds/3192) and take a look at the build logs.
- Check if the failure is related to this commit (b3da6989524001d707ebb386bfab998452aa57bb) or if it is a false positive.
- If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.
You can take a look at the buildbot page here:
https://buildbot.python.org/all/#builders/424/builds/3192
Failed tests:
- test_eintr
Failed subtests:
- test_all - test.test_eintr.EINTRTests.test_all
- test_flock - main.FNTLEINTRTest.test_flock
Summary of the results of the build (if available):
== Tests result: FAILURE then ENV CHANGED ==
411 tests OK.
10 slowest tests:
- test_largefile: 6 min 34 sec
- test_venv: 6 min 6 sec
- test_eintr: 5 min 14 sec
- test_multiprocessing_spawn: 3 min 54 sec
- test_asyncio: 3 min 10 sec
- test_concurrent_futures: 2 min 56 sec
- test_multiprocessing_forkserver: 2 min 36 sec
- test_compileall: 2 min 10 sec
- test_tokenize: 2 min 4 sec
- test_gdb: 1 min 59 sec
1 test altered the execution environment: test_multiprocessing_forkserver
21 tests skipped: test_check_c_globals test_devpoll test_idle test_ioctl test_kqueue test_launcher test_msilib test_peg_generator test_perf_profiler test_startfile test_tcl test_tix test_tkinter test_ttk test_ttk_textonly test_turtle test_winconsoleio test_winreg test_winsound test_wmi test_zipfile64
1 re-run test: test_eintr
Total duration: 31 min 18 sec
Click to see traceback logs
Traceback (most recent call last):
File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/test_eintr.py", line 16, in test_all
script_helper.run_test_script(script)
File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/support/script_helper.py", line 300, in run_test_script
raise AssertionError(f"{name} failed")
AssertionError: script _test_eintr.py failed
Traceback (most recent call last):
File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/_test_eintr.py", line 526, in test_flock
self._lock(fcntl.flock, "flock")
File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/_test_eintr.py", line 508, in _lock
raise Exception("failed to sync child in %.1f sec" % dt)
Exception: failed to sync child in 300.3 sec