Add PatternDefeatingQuicksort

Open LSchwerdt opened this issue 2 years ago • 1 comments

Add Pattern-defeating Quicksort, an unstable, O(1) space, guaranteed* O(n log n) time variant of quicksort. Both normal, and branchless (block-) partitioning schemes are implemented.

(*) by falling back to HeapSort for pathological inputs

Highlights (BranchlessPdqSort):

20% - 50% faster than current default when sorting Int128s and UInt128s (while using O(1) space instead of O(n))
Significantly faster than default for Int64s and UInt64s for special inputs:
- Almost presorted
- Almost reverse presorted
- Many duplicates, but wide value range
Edit: Enables sortperm using the same memory footprint as current default sortperm, by simultaneously sorting values and indices. This achieves 5x-20x speedup vs sortperm for 2 GiB of Int64s, depending on hardware. It is always faster than sortperm, except for
- very small problem sizes (that fit in L1$),
- medium problem sizes (that fit in L3$) when sortperm uses special optimizations (Floats, Missing).

Checklist

[x] Test and implement feedback from PR #71 where applicable.
- [x] Additional correctness checks
- [x] Function argument ordering
- [x] Variable names
- [x] Midpoint off by one? (only relevant for performance here)
[x] Rebase and include initial optimizations from PR #70.
[ ] This PR introduces a dependency on StaticArrays with very little benefit. If PR #71 is not merged (which needs StaticArrays), use regular arrays instead.
[x] Cover special cases in testing.
[ ] Update README.md
[ ] Bump version number.

Open Question

BranchlessPdqSort is the main star of this PR. The only advantage of BranchyPdqSort is, that it uses zero allocations. Maybe define const PdqSort = BranchlessPdqSort, or delete BranchyPdqSort entirely?

Notes

Uses randomization instead of deterministic pattern breaking after an unbalanced partition. The implementations in Go and Rust do aswell, and the orginal author endorses that approach.
There are potentially faster QuickSort variants ([1],[2]), but implementing them in pure Julia is hard. /impossible?

Benchmarks

The benchmarks can be recreated using this Pluto notebook. A script for very quick benchmarking is below at the end.

Full benchmark results (9d36ce4):

AMD 3900X, 2ch DDR4-3600 CL16 (≙ 8.9ns) Intel 7820X, 4ch DDR4-2666 CL19 (≙14.3ns) Intel 6700HQ, 2ch DDR4-2400 CL14 (≙11.7ns)

Here are the results from the 7820X (b603963):

Sort Int64

pdqSort_bench_Int64 pdqSort_bench_Int64_speedup

Sort Int128

pdqSort_bench_Int128 pdqSort_bench_Int128_speedup

Sort special inputs

pdqSort_bench_specialinputs

SimultaneousSortperm

A package that implements sortperm using PdqSort is here. The benchmarking code and more results are found here.

Results (Intel 7820x)

Int64 Int64_speedup Int64_almost_presorted Int64_almost_presorted_speedup Float64 Float64_speedup Float64_by_abs2 Float64_by_abs2_speedup Categorical

Quick benchmarking script

using BenchmarkTools, Random, SortingAlgorithms
versioninfo()
for i in 2:5
    n = 17^i
    println("sort!(rand(Int128, $n))")
    v = rand(Int128, n)
    print("Default:           "); @btime sort!($v) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end

function setupv!(v)
    v[1:end-1] = 1:length(v)-1
    v[end] = 10
end

for i in 2:5
    n = 17^i
    println("sort!([1:$n;10])")
    v = [1:n;10]
    print("Default:           "); @btime sort!($v) setup=(setupv!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(setupv!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(setupv!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(setupv!($v)) evals=1
end

for i in 2:5
    n = 17^i
    println("sort!(rand(Complex{Float64},$n), by=abs2)")
    v = rand(Complex{Float64},n)
    print("Default:           "); @btime sort!($v, by=abs2) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v, by=abs2; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v, by=abs2; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v, by=abs2; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end

Result on the 3900x (b603963)

Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 12 on 24 virtual cores      
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12
sort!(rand(Int128, 289))
Default:             5.400 μs (1 allocation: 4.69 KiB)
Quicksort:           6.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      6.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   4.700 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 4913))
Default:             119.900 μs (2 allocations: 76.86 KiB)
Quicksort:           165.500 μs (0 allocations: 0 bytes)
BranchyPdqSort:      171.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   103.500 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 83521))
Default:             2.711 ms (2 allocations: 1.27 MiB)
Quicksort:           4.190 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.260 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   2.294 ms (2 allocations: 1.06 KiB)
sort!(rand(Int128, 1419857))
Default:             60.424 ms (2 allocations: 21.67 MiB)
Quicksort:           89.065 ms (0 allocations: 0 bytes)
BranchyPdqSort:      91.504 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   46.587 ms (2 allocations: 1.06 KiB)
sort!([1:289;10])
Default:             1.200 μs (2 allocations: 2.77 KiB)
Quicksort:           1.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      1.000 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   1.200 μs (2 allocations: 1.06 KiB)
sort!([1:4913;10])
Default:             19.900 μs (3 allocations: 39.55 KiB)
Quicksort:           29.800 μs (0 allocations: 0 bytes)
BranchyPdqSort:      14.900 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   15.000 μs (2 allocations: 1.06 KiB)
sort!([1:83521;10])
Default:             448.400 μs (3 allocations: 656.80 KiB)
Quicksort:           651.300 μs (0 allocations: 0 bytes)
BranchyPdqSort:      257.300 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   257.700 μs (2 allocations: 1.06 KiB)
sort!([1:1419857;10])
Default:             13.690 ms (3 allocations: 10.83 MiB)
Quicksort:           13.954 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.400 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   4.178 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},289), by=abs2)
Default:             8.200 μs (1 allocation: 4.69 KiB)
Quicksort:           11.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      11.800 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   7.900 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},4913), by=abs2)
Default:             211.000 μs (2 allocations: 76.86 KiB)
Quicksort:           314.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      323.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   179.800 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},83521), by=abs2)
Default:             5.041 ms (2 allocations: 1.27 MiB)
Quicksort:           7.692 ms (0 allocations: 0 bytes)
BranchyPdqSort:      7.922 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   3.739 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},1419857), by=abs2)
Default:             109.978 ms (2 allocations: 21.67 MiB)
Quicksort:           166.103 ms (0 allocations: 0 bytes)
BranchyPdqSort:      172.113 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   75.914 ms (2 allocations: 1.06 KiB)

Additional correctness checks

No stable sorting guarantee -> simpler checks.

test script

using Base.Order
using Test
using SortingAlgorithms
for T in (Float64, Int, UInt8, UInt128)
    println(T)
    for alg in [BranchlessPdqSort, BranchyPdqSort]
        for order in [Forward, Reverse, By(identity)]
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n)
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n) .% 4
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = [9;sort(rand(T, n));8]
                @test sort(v; order) == sort(v; alg, order)
            end
        end
    end
end

Feb 11 '23 10:02 LSchwerdt

Codecov Report

Merging #72 (8b44ef3) into master (2659334) will increase coverage by 0.96%. The diff coverage is 98.37%.

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   96.02%   96.98%   +0.96%     
==========================================
  Files           1        1              
  Lines         352      598     +246     
==========================================
+ Hits          338      580     +242     
- Misses         14       18       +4

Impacted Files	Coverage Δ
src/SortingAlgorithms.jl	`96.98% <98.37%> (+0.96%)`	:arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Feb 11 '23 11:02 codecov-commenter

SortingAlgorithms.jl SortingAlgorithms.jl copied to clipboard

Add PatternDefeatingQuicksort

Highlights (BranchlessPdqSort):

Checklist

Open Question

Notes

Benchmarks

Sort Int64

Sort Int128

Sort special inputs

SimultaneousSortperm

Quick benchmarking script

Additional correctness checks

Codecov Report

SortingAlgorithms.jl
SortingAlgorithms.jl copied to clipboard