cpython gh-94603: micro optimize list.pop

This PR optmizes list.pop

Benchmark results

x=[1,2]; x.pop(0): Mean +- std dev: [m2] 81.8 ns +- 2.5 ns -> [pr_special] 68.9 ns +- 2.0 ns: 1.19x faster
x=[1,2]; x.pop(1): Mean +- std dev: [m2] 65.4 ns +- 2.1 ns -> [pr_special] 64.4 ns +- 0.6 ns: 1.02x faster
x=[]; x.insert(0,1); x.pop(0): Mean +- std dev: [m2] 111 ns +- 3 ns -> [pr_special] 80.2 ns +- 1.1 ns: 1.38x faster
list(0): insert(0, None) + pop(0): Mean +- std dev: [m2] 78.4 ns +- 0.8 ns -> [pr_special] 62.1 ns +- 1.3 ns: 1.26x faster
list(1): insert(0, None) + pop(1): Mean +- std dev: [m2] 59.7 ns +- 0.4 ns -> [pr_special] 57.7 ns +- 0.9 ns: 1.04x faster
list(2): insert(0, None) + pop(1): Mean +- std dev: [m2] 66.9 ns +- 0.6 ns -> [pr_special] 55.5 ns +- 1.8 ns: 1.20x faster
list(10): insert(0, None) + pop(1): Mean +- std dev: [m2] 71.9 ns +- 0.6 ns -> [pr_special] 62.4 ns +- 2.8 ns: 1.15x faster
list(100): insert(0, None) + pop(1): Mean +- std dev: [m2] 108 ns +- 4 ns -> [pr_special] 93.5 ns +- 3.3 ns: 1.16x faster
list(10000): insert(0, None) + pop(1): Mean +- std dev: [m2] 5.41 us +- 0.02 us -> [pr_special] 5.37 us +- 0.02 us: 1.01x faster

Benchmark hidden because not significant (2): x=[1]; x.pop(0), sortedcontainer

Geometric mean: 1.12x faster

Benchmark code

import pyperf
import sortedcontainers


runner = pyperf.Runner()

runner.timeit(name=f"x=[1]; x.pop(0)",stmt='x=[1]; x.pop(0)')
runner.timeit(name=f"x=[1,2]; x.pop(0)", stmt='x=[1,2]; x.pop(0)')
runner.timeit(name=f"x=[1,2]; x.pop(1)",stmt='x=[1,2]; x.pop(1)')
runner.timeit(name=f"x=[]; x.insert(0,1); x.pop(0)",stmt='x=[]; x.insert(0,1); x.pop(0)')
runner.timeit(name=f"list(0): insert(0, None) + pop(0)", stmt="lst.insert(0, None); lst.pop(0)", setup='lst=[]')

for n in [ 1, 2, 10, 100, 10_000]:
    runner.timeit(name=f"list({n}): insert(0, None) + pop(1)",
	      stmt="lst.insert(0, None); lst.pop(1)", setup=f'import random; lst=[random.random() for ii in range({n})]')

setup ="import random; import sortedcontainers; sc=sortedcontainers.SortedList(); x=[random.random() for n in range(1_00_000)]"

runner.timeit(name=f"sortedcontainer", stmt="[sc.add(v) for v in x]", setup=setup)

Issue: gh-94603

Jul 06 '22 12:07 eendebakpt

@serhiy-storchaka is the expert who can decide these kinds of micro-optimization is worth applying :)

Jul 08 '22 05:07 corona10

I am a bit surprised to see a significant difference in that microbenchmarks. The method call adds an overhead, and since this method is not idempotent, you need to add even more expensive code to create a list or adding an item in the existing list list at every iteration, so I expected the difference be rather unnoticeable. But I repeated microbenchmarking myself and confirm the results. It looks interesting.

Now, we need to analyze, why we get such significant speedup, can the optimization be generalized to other methods, can we get the same results with simpler changes?

Jul 09 '22 07:07 serhiy-storchaka

@serhiy-storchaka Here are some more details

The test/test_list.py is not part of the PGO cycle. The list.pop is called multiple times though, so it is part of the profiling so I did not change that
The specialization for popping the last element of a list is required to achieve the same performance as current main. Benchmarks on runner.timeit(name=f"pop last item",stmt='x=[1,2,3]; x.pop(-1)'):

Main

pop last item: Mean +- std dev: 68.2 ns +- 0.8 ns

PR

pop last item: Mean +- std dev: 65.0 ns +- 1.1 ns

PR: No specialization for last element

pop last item: Mean +- std dev: 73.6 ns +- 0.5 ns

The specialization for empty lists is required as well. Benchmark runner.timeit(name=f"pop to empty",stmt='x=[1,]; x.pop(0)'):

Main

pop to empty: Mean +- std dev: 64.0 ns +- 0.4 ns

PR

pop to empty: Mean +- std dev: 61.5 ns +- 0.9 ns

PR: No specialization for last element

pop to empty: Mean +- std dev: 64.5 ns +- 0.4 ns

(differences are small though)

The reason for the speedup might be that we avoid an extra call to memcopy in list_ass_slice (e.g. we replace memcpy(recycle, &item[ilow], s); with a direct assignment)

Jul 11 '22 18:07 eendebakpt

Added similar optimizations to list_ass_slice does not seem to have the same performance benefits. Not sure why that is

https://github.com/python/cpython/compare/main...eendebakpt:list_assignment_optimization

results in:

x=[1,2]; x.pop(0): Mean +- std dev: [m5] 80.8 ns +- 1.1 ns -> [pr_slice3] 79.3 ns +- 3.2 ns: 1.02x faster
x=[1,2]; x.pop(1): Mean +- std dev: [m5] 65.0 ns +- 0.4 ns -> [pr_slice3] 66.7 ns +- 2.1 ns: 1.03x slower
list(0): insert(0, None) + pop(0): Mean +- std dev: [m5] 79.1 ns +- 0.6 ns -> [pr_slice3] 83.8 ns +- 0.5 ns: 1.06x slower
list(50): x.append(1); x.pop(-2): Mean +- std dev: [m5] 48.1 ns +- 1.4 ns -> [pr_slice3] 45.6 ns +- 0.3 ns: 1.06x faster
list(1): insert(0, None) + pop(1): Mean +- std dev: [m5] 57.6 ns +- 1.0 ns -> [pr_slice3] 61.7 ns +- 0.3 ns: 1.07x slower
list(2): insert(0, None) + pop(1): Mean +- std dev: [m5] 64.8 ns +- 0.5 ns -> [pr_slice3] 68.0 ns +- 2.2 ns: 1.05x slower
list(10): insert(0, None) + pop(1): Mean +- std dev: [m5] 69.1 ns +- 1.0 ns -> [pr_slice3] 73.3 ns +- 1.8 ns: 1.06x slower
list(100): insert(0, None) + pop(1): Mean +- std dev: [m5] 119 ns +- 2 ns -> [pr_slice3] 108 ns +- 4 ns: 1.11x faster
list(10000): insert(0, None) + pop(1): Mean +- std dev: [m5] 5.82 us +- 0.03 us -> [pr_slice3] 5.40 us +- 0.16 us: 1.08x faster
sortedcontainer: Mean +- std dev: [m5] 105 ms +- 1 ms -> [pr_slice3] 110 ms +- 4 ms: 1.05x slower

Benchmark hidden because not significant (2): x=[1]; x.pop(0), x=[]; x.insert(0,1); x.pop(0)

Geometric mean: 1.00x slower

Jul 11 '22 20:07 eendebakpt

Added similar optimizations to list_ass_slice does not seem to have the same performance benefits. Not sure why that is

A small variant https://github.com/python/cpython/compare/main...eendebakpt:list_assignment_optimization2 also gives performance similar to main. Avoiding a single memcpy does not fully explain the performance gain.

The list_ass_slice does more work than is required for the case of a simple pop. When calling from list_pop_impl there are 7 redundant if statements. E.g. if-statements that are required for the implementation of list_ass_slice, but that would not be required if the method is only called from list_pop_impl. The additional 7 if statements (and avoided memcpy) can perhaps explain the performance difference we see. So the performance gain is not due to a single improvement, but a combination of several small gains added together.

@serhiy-storchaka

Aug 02 '22 21:08 eendebakpt

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Dec 24 '22 05:12 bedevere-bot

I have made the requested changes; please review again

Dec 26 '22 20:12 eendebakpt

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

Dec 26 '22 20:12 bedevere-bot

:robot: New build scheduled with the buildbot fleet by @corona10 for commit 7c1a8355d7ca723039d05546d171a2359644ce1c :robot:

If you want to schedule another build, you need to add the :hammer: test-with-refleak-buildbots label again.

Dec 27 '22 02:12 bedevere-bot

:robot: New build scheduled with the buildbot fleet by @corona10 for commit 23eb3aa79c6e0fe9d0bcf4a969adb975f6d40bae :robot:

If you want to schedule another build, you need to add the :hammer: test-with-refleak-buildbots label again.

Dec 27 '22 03:12 bedevere-bot

:warning::warning::warning: Buildbot failure :warning::warning::warning:

Hi! The buildbot ARM Raspbian 3.x has failed when building commit b3da6989524001d707ebb386bfab998452aa57bb.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/424/builds/3192) and take a look at the build logs.
Check if the failure is related to this commit (b3da6989524001d707ebb386bfab998452aa57bb) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/424/builds/3192

Failed tests:

test_eintr

Failed subtests:

test_all - test.test_eintr.EINTRTests.test_all
test_flock - main.FNTLEINTRTest.test_flock

Summary of the results of the build (if available):

== Tests result: FAILURE then ENV CHANGED ==

411 tests OK.

10 slowest tests:

test_largefile: 6 min 34 sec
test_venv: 6 min 6 sec
test_eintr: 5 min 14 sec
test_multiprocessing_spawn: 3 min 54 sec
test_asyncio: 3 min 10 sec
test_concurrent_futures: 2 min 56 sec
test_multiprocessing_forkserver: 2 min 36 sec
test_compileall: 2 min 10 sec
test_tokenize: 2 min 4 sec
test_gdb: 1 min 59 sec

1 test altered the execution environment: test_multiprocessing_forkserver

21 tests skipped: test_check_c_globals test_devpoll test_idle test_ioctl test_kqueue test_launcher test_msilib test_peg_generator test_perf_profiler test_startfile test_tcl test_tix test_tkinter test_ttk test_ttk_textonly test_turtle test_winconsoleio test_winreg test_winsound test_wmi test_zipfile64

1 re-run test: test_eintr

Total duration: 31 min 18 sec

Click to see traceback logs

Traceback (most recent call last):
  File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/test_eintr.py", line 16, in test_all
    script_helper.run_test_script(script)
  File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/support/script_helper.py", line 300, in run_test_script
    raise AssertionError(f"{name} failed")
AssertionError: script _test_eintr.py failed


Traceback (most recent call last):
  File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/_test_eintr.py", line 526, in test_flock
    self._lock(fcntl.flock, "flock")
  File "/var/lib/buildbot/workers/3.x.gps-raspbian.nondebug/build/Lib/test/_test_eintr.py", line 508, in _lock
    raise Exception("failed to sync child in %.1f sec" % dt)
Exception: failed to sync child in 300.3 sec

Dec 27 '22 11:12 bedevere-bot

cpython cpython copied to clipboard

gh-94603: micro optimize list.pop

:warning::warning::warning: Buildbot failure :warning::warning::warning:

cpython
cpython copied to clipboard