SortingAlgorithms.jl
SortingAlgorithms.jl copied to clipboard
Add PatternDefeatingQuicksort
Add Pattern-defeating Quicksort, an unstable, O(1) space, guaranteed* O(n log n) time variant of quicksort. Both normal, and branchless (block-) partitioning schemes are implemented.
(*) by falling back to HeapSort for pathological inputs
Highlights (BranchlessPdqSort):
- 20% - 50% faster than current default when sorting Int128s and UInt128s (while using O(1) space instead of O(n))
- Significantly faster than default for Int64s and UInt64s for special inputs:
- Almost presorted
- Almost reverse presorted
- Many duplicates, but wide value range
- Edit: Enables sortperm using the same memory footprint as current default sortperm, by simultaneously sorting values and indices. This achieves 5x-20x speedup vs sortperm for 2 GiB of Int64s, depending on hardware. It is always faster than sortperm, except for
- very small problem sizes (that fit in L1$),
- medium problem sizes (that fit in L3$) when sortperm uses special optimizations (Floats, Missing).
Checklist
- [x] Test and implement feedback from PR #71 where applicable.
- [x] Additional correctness checks
- [x] Function argument ordering
- [x] Variable names
- [x] Midpoint off by one? (only relevant for performance here)
- [x] Rebase and include initial optimizations from PR #70.
- [ ] This PR introduces a dependency on StaticArrays with very little benefit. If PR #71 is not merged (which needs StaticArrays), use regular arrays instead.
- [x] Cover special cases in testing.
- [ ] Update README.md
- [ ] Bump version number.
Open Question
- BranchlessPdqSort is the main star of this PR. The only advantage of BranchyPdqSort is, that it uses zero allocations.
Maybe define
const PdqSort = BranchlessPdqSort
, or delete BranchyPdqSort entirely?
Notes
- Uses randomization instead of deterministic pattern breaking after an unbalanced partition. The implementations in Go and Rust do aswell, and the orginal author endorses that approach.
- There are potentially faster QuickSort variants ([1],[2]), but implementing them in pure Julia is hard. /impossible?
Benchmarks
The benchmarks can be recreated using this Pluto notebook. A script for very quick benchmarking is below at the end.
Full benchmark results (9d36ce4):
AMD 3900X, 2ch DDR4-3600 CL16 (≙ 8.9ns) Intel 7820X, 4ch DDR4-2666 CL19 (≙14.3ns) Intel 6700HQ, 2ch DDR4-2400 CL14 (≙11.7ns)
Here are the results from the 7820X (b603963):
Sort Int64
Sort Int128
Sort special inputs
SimultaneousSortperm
A package that implements sortperm using PdqSort is here. The benchmarking code and more results are found here.
Results (Intel 7820x)
Quick benchmarking script
using BenchmarkTools, Random, SortingAlgorithms
versioninfo()
for i in 2:5
n = 17^i
println("sort!(rand(Int128, $n))")
v = rand(Int128, n)
print("Default: "); @btime sort!($v) setup=(rand!($v)) evals=1
print("Quicksort: "); @btime sort!($v; alg=QuickSort) setup=(rand!($v)) evals=1
print("BranchyPdqSort: "); @btime sort!($v; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end
function setupv!(v)
v[1:end-1] = 1:length(v)-1
v[end] = 10
end
for i in 2:5
n = 17^i
println("sort!([1:$n;10])")
v = [1:n;10]
print("Default: "); @btime sort!($v) setup=(setupv!($v)) evals=1
print("Quicksort: "); @btime sort!($v; alg=QuickSort) setup=(setupv!($v)) evals=1
print("BranchyPdqSort: "); @btime sort!($v; alg=BranchyPdqSort) setup=(setupv!($v)) evals=1
print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(setupv!($v)) evals=1
end
for i in 2:5
n = 17^i
println("sort!(rand(Complex{Float64},$n), by=abs2)")
v = rand(Complex{Float64},n)
print("Default: "); @btime sort!($v, by=abs2) setup=(rand!($v)) evals=1
print("Quicksort: "); @btime sort!($v, by=abs2; alg=QuickSort) setup=(rand!($v)) evals=1
print("BranchyPdqSort: "); @btime sort!($v, by=abs2; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
print("BranchlessPdqSort: "); @btime sort!($v, by=abs2; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end
Result on the 3900x (b603963)
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 12 on 24 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 12
sort!(rand(Int128, 289))
Default: 5.400 μs (1 allocation: 4.69 KiB)
Quicksort: 6.200 μs (0 allocations: 0 bytes)
BranchyPdqSort: 6.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 4.700 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 4913))
Default: 119.900 μs (2 allocations: 76.86 KiB)
Quicksort: 165.500 μs (0 allocations: 0 bytes)
BranchyPdqSort: 171.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 103.500 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 83521))
Default: 2.711 ms (2 allocations: 1.27 MiB)
Quicksort: 4.190 ms (0 allocations: 0 bytes)
BranchyPdqSort: 4.260 ms (0 allocations: 0 bytes)
BranchlessPdqSort: 2.294 ms (2 allocations: 1.06 KiB)
sort!(rand(Int128, 1419857))
Default: 60.424 ms (2 allocations: 21.67 MiB)
Quicksort: 89.065 ms (0 allocations: 0 bytes)
BranchyPdqSort: 91.504 ms (0 allocations: 0 bytes)
BranchlessPdqSort: 46.587 ms (2 allocations: 1.06 KiB)
sort!([1:289;10])
Default: 1.200 μs (2 allocations: 2.77 KiB)
Quicksort: 1.200 μs (0 allocations: 0 bytes)
BranchyPdqSort: 1.000 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 1.200 μs (2 allocations: 1.06 KiB)
sort!([1:4913;10])
Default: 19.900 μs (3 allocations: 39.55 KiB)
Quicksort: 29.800 μs (0 allocations: 0 bytes)
BranchyPdqSort: 14.900 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 15.000 μs (2 allocations: 1.06 KiB)
sort!([1:83521;10])
Default: 448.400 μs (3 allocations: 656.80 KiB)
Quicksort: 651.300 μs (0 allocations: 0 bytes)
BranchyPdqSort: 257.300 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 257.700 μs (2 allocations: 1.06 KiB)
sort!([1:1419857;10])
Default: 13.690 ms (3 allocations: 10.83 MiB)
Quicksort: 13.954 ms (0 allocations: 0 bytes)
BranchyPdqSort: 4.400 ms (0 allocations: 0 bytes)
BranchlessPdqSort: 4.178 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},289), by=abs2)
Default: 8.200 μs (1 allocation: 4.69 KiB)
Quicksort: 11.600 μs (0 allocations: 0 bytes)
BranchyPdqSort: 11.800 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 7.900 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},4913), by=abs2)
Default: 211.000 μs (2 allocations: 76.86 KiB)
Quicksort: 314.600 μs (0 allocations: 0 bytes)
BranchyPdqSort: 323.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort: 179.800 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},83521), by=abs2)
Default: 5.041 ms (2 allocations: 1.27 MiB)
Quicksort: 7.692 ms (0 allocations: 0 bytes)
BranchyPdqSort: 7.922 ms (0 allocations: 0 bytes)
BranchlessPdqSort: 3.739 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},1419857), by=abs2)
Default: 109.978 ms (2 allocations: 21.67 MiB)
Quicksort: 166.103 ms (0 allocations: 0 bytes)
BranchyPdqSort: 172.113 ms (0 allocations: 0 bytes)
BranchlessPdqSort: 75.914 ms (2 allocations: 1.06 KiB)
Additional correctness checks
No stable sorting guarantee -> simpler checks.
test script
using Base.Order
using Test
using SortingAlgorithms
for T in (Float64, Int, UInt8, UInt128)
println(T)
for alg in [BranchlessPdqSort, BranchyPdqSort]
for order in [Forward, Reverse, By(identity)]
for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
v = rand(T, n)
@test sort(v; order) == sort(v; alg, order)
end
for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
v = rand(T, n) .% 4
@test sort(v; order) == sort(v; alg, order)
end
for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
v = [9;sort(rand(T, n));8]
@test sort(v; order) == sort(v; alg, order)
end
end
end
end
Codecov Report
Merging #72 (8b44ef3) into master (2659334) will increase coverage by
0.96%
. The diff coverage is98.37%
.
@@ Coverage Diff @@
## master #72 +/- ##
==========================================
+ Coverage 96.02% 96.98% +0.96%
==========================================
Files 1 1
Lines 352 598 +246
==========================================
+ Hits 338 580 +242
- Misses 14 18 +4
Impacted Files | Coverage Δ | |
---|---|---|
src/SortingAlgorithms.jl | 96.98% <98.37%> (+0.96%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more