rust Add `optimize_for_size` variants for stable and unstable sort as well as select_nth

Stable sort uses a simple merge-sort that re-uses the existing - rather gnarly - merge function.
Unstable sort jumps directly to the branchless heapsort fallback.
select_nth_unstable jumps directly to the median_of_medians fallback, which is augmented with a custom tiny smallsort and partition impl.

Some code is duplicated but de-duplication would bring it's own problems. For example swap_if_less is critical for performance, if the sorting networks don't inline it perf drops drastically, however #[inline(always)] is also a poor fit, if the provided comparison function is huge, it gives the compiler an out to only instantiate swap_if_less once and call it. Another aspect that would suffer when making swap_if_less pub, is having to cfg out dozens of functions in in smallsort module.

Part of https://github.com/rust-lang/rust/issues/125612

r? @Kobzol

Aug 25 '24 18:08 Voultapher

r? @scottmcm

rustbot has assigned @scottmcm. They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Aug 25 '24 18:08 rustbot

r? @Kobzol

Aug 25 '24 18:08 Voultapher

Could not assign reviewer from: Kobzol. User(s) Kobzol are either the PR author, already assigned, or on vacation, and there are no other candidates. Use r? to specify someone else to assign.

Aug 25 '24 18:08 rustbot

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

#16 2.865 Building wheels for collected packages: reuse
#16 2.866   Building wheel for reuse (pyproject.toml): started
#16 3.117   Building wheel for reuse (pyproject.toml): finished with status 'done'
#16 3.118   Created wheel for reuse: filename=reuse-4.0.3-cp310-cp310-manylinux_2_35_x86_64.whl size=132715 sha256=dfa09868353292d98f811d3efdb0d54d07389e808efc71d68e3b93c514bf8bec
#16 3.119   Stored in directory: /tmp/pip-ephem-wheel-cache-nq13luie/wheels/3d/8d/0a/e0fc6aba4494b28a967ab5eaf951c121d9c677958714e34532
#16 3.121 Installing collected packages: boolean-py, binaryornot, tomlkit, reuse, python-debian, markupsafe, license-expression, jinja2, chardet, attrs
#16 3.532 Successfully installed attrs-23.2.0 binaryornot-0.4.4 boolean-py-4.0 chardet-5.2.0 jinja2-3.1.4 license-expression-30.3.0 markupsafe-2.1.5 python-debian-0.1.49 reuse-4.0.3 tomlkit-0.13.0
#16 3.532 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#16 DONE 3.6s

Aug 25 '24 18:08 rust-log-analyzer

r? libs

Aug 25 '24 19:08 Kobzol

One thing that is still outstanding is, how do we document the different implementation behavior?

I think adding even more text to the 10 places that document the current implementation of sort, sort_unstable and select_nth_unstable isn't a good idea. Maybe there could be something in a specific Rust book that mentions the implications of using optimize_for_size.

That said I think with how recent and WIP optimize_for_size still is, I'd say it's not a blocker for this PR.

Aug 26 '24 07:08 Voultapher

I think the open points have been addressed, if not, please re-open. Good to be merged from my side.

Aug 29 '24 08:08 Voultapher

Would be nice to measure the binary-size impact of this change in a real setting, here are the result of our tool that we also used in the design documents:

slice::sort:

Configuration	Type	Size default (bytes)	Size `optimize_for_size` (bytes)
release	`u64`	5304	602
release	`String`	6630	838
release_lto_thin	`u64`	5317	570
release_lto_thin	`String`	6607	806
release_lto_thin_opt_level_s	`u64`	3748	584
release_lto_thin_opt_level_s	`String`	5448	813

slice::sort_unstable:

Configuration	Type	Size default (bytes)	Size `optimize_for_size` (bytes)
release	`u64`	3308	182
release	`String`	4968	488
release_lto_thin	`u64`	3308	182
release_lto_thin	`String`	4968	488
release_lto_thin_opt_level_s	`u64`	3034	168
release_lto_thin_opt_level_s	`String`	4281	410

slice::select_nth_unstable: index = v.len() / 2

Configuration	Type	Size default (bytes)	Size `optimize_for_size` (bytes)
release	`u64`	4331	3075
release	`String`	6204	4002
release_lto_thin	`u64`	4331	3075
release_lto_thin	`String`	6204	4002
release_lto_thin_opt_level_s	`u64`	3081	2032
release_lto_thin_opt_level_s	`String`	5603	3697

Aug 29 '24 08:08 Voultapher

Pushed a bunch of changes, and updated the table above.

Sep 04 '24 18:09 Voultapher

Some perf results:

Sep 05 '24 11:09 Voultapher

LGTM, and thanks for the detailed tables and graphs as well!

@bors r+ rollup=never (just to be sure of independent perf)

Sep 24 '24 17:09 cuviper

:pushpin: Commit 54391983486d5271c378821c8e6a8f018dfdd14d has been approved by cuviper

It is now in the queue for this repository.

Sep 24 '24 17:09 bors

:hourglass: Testing commit 54391983486d5271c378821c8e6a8f018dfdd14d with merge 363ae4188316b8b22cf6c1890bc73d84d05f70a4...

Sep 24 '24 18:09 bors

:sunny: Test successful - checks-actions Approved by: cuviper Pushing 363ae4188316b8b22cf6c1890bc73d84d05f70a4 to master...

Sep 24 '24 21:09 bors

Finished benchmarking commit (363ae4188316b8b22cf6c1890bc73d84d05f70a4): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.2%, 0.3%]	3
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.3%	[0.2%, 0.3%]	3

Max RSS (memory usage)

Results (primary 5.7%, secondary 2.6%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	5.7%	[4.0%, 9.1%]	3
Regressions ❌ (secondary)	2.6%	[2.0%, 3.1%]	3
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	5.7%	[4.0%, 9.1%]	3

Cycles

Results (primary -0.5%, secondary 3.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	3.8%	[2.8%, 4.3%]	5
Improvements ✅ (primary)	-0.5%	[-0.5%, -0.5%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.5%	[-0.5%, -0.5%]	1

Binary size

Results (primary 0.0%, secondary -0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.0%, 0.8%]	14
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.1%	[-0.1%, -0.0%]	21
Improvements ✅ (secondary)	-0.1%	[-0.2%, -0.1%]	38
All ❌✅ (primary)	0.0%	[-0.1%, 0.8%]	35

Bootstrap: 767.328s -> 768.432s (0.14%) Artifact size: 340.88 MiB -> 340.84 MiB (-0.01%)

Sep 24 '24 22:09 rust-timer

Not sure the rust-timer run tells us anything useful here, given that all changes here happened to code behind a libcore feature that isn't being built IIUC.

Sep 25 '24 13:09 Voultapher

I don't think that further investigation is needed. This has seemingly perturbed codegen a little bit, but otherwise it's hidden behind an optional flag. The image regression has flipped back right after.

@rustbot label: +perf-regression-triaged

Oct 01 '24 07:10 Kobzol

rust
rust copied to clipboard

Add `optimize_for_size` variants for stable and unstable sort as well as select_nth_unstable

Overall result: ❌ regressions - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Binary size

rust rust copied to clipboard

Add `optimize_for_size` variants for stable and unstable sort as well as select_nth_unstable

Overall result: ❌ regressions - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Binary size

rust
rust copied to clipboard