SortingAlgorithms.jl icon indicating copy to clipboard operation
SortingAlgorithms.jl copied to clipboard

Add PatternDefeatingQuicksort

Open LSchwerdt opened this issue 2 years ago • 1 comments

Add Pattern-defeating Quicksort, an unstable, O(1) space, guaranteed* O(n log n) time variant of quicksort. Both normal, and branchless (block-) partitioning schemes are implemented.

(*) by falling back to HeapSort for pathological inputs

Highlights (BranchlessPdqSort):

  • 20% - 50% faster than current default when sorting Int128s and UInt128s (while using O(1) space instead of O(n))
  • Significantly faster than default for Int64s and UInt64s for special inputs:
    • Almost presorted
    • Almost reverse presorted
    • Many duplicates, but wide value range
  • Edit: Enables sortperm using the same memory footprint as current default sortperm, by simultaneously sorting values and indices. This achieves 5x-20x speedup vs sortperm for 2 GiB of Int64s, depending on hardware. It is always faster than sortperm, except for
    • very small problem sizes (that fit in L1$),
    • medium problem sizes (that fit in L3$) when sortperm uses special optimizations (Floats, Missing).

Checklist

  • [x] Test and implement feedback from PR #71 where applicable.
    • [x] Additional correctness checks
    • [x] Function argument ordering
    • [x] Variable names
    • [x] Midpoint off by one? (only relevant for performance here)
  • [x] Rebase and include initial optimizations from PR #70.
  • [ ] This PR introduces a dependency on StaticArrays with very little benefit. If PR #71 is not merged (which needs StaticArrays), use regular arrays instead.
  • [x] Cover special cases in testing.
  • [ ] Update README.md
  • [ ] Bump version number.

Open Question

  • BranchlessPdqSort is the main star of this PR. The only advantage of BranchyPdqSort is, that it uses zero allocations. Maybe define const PdqSort = BranchlessPdqSort, or delete BranchyPdqSort entirely?

Notes

  • Uses randomization instead of deterministic pattern breaking after an unbalanced partition. The implementations in Go and Rust do aswell, and the orginal author endorses that approach.
  • There are potentially faster QuickSort variants ([1],[2]), but implementing them in pure Julia is hard. /impossible?

Benchmarks

The benchmarks can be recreated using this Pluto notebook. A script for very quick benchmarking is below at the end.

Full benchmark results (9d36ce4):

AMD 3900X, 2ch DDR4-3600 CL16 (≙ 8.9ns) Intel 7820X, 4ch DDR4-2666 CL19 (≙14.3ns) Intel 6700HQ, 2ch DDR4-2400 CL14 (≙11.7ns)

Here are the results from the 7820X (b603963):

Sort Int64

pdqSort_bench_Int64 pdqSort_bench_Int64_speedup

Sort Int128

pdqSort_bench_Int128 pdqSort_bench_Int128_speedup

Sort special inputs

pdqSort_bench_specialinputs

SimultaneousSortperm

A package that implements sortperm using PdqSort is here. The benchmarking code and more results are found here.

Results (Intel 7820x)

Int64 Int64_speedup Int64_almost_presorted Int64_almost_presorted_speedup Float64 Float64_speedup Float64_by_abs2 Float64_by_abs2_speedup Categorical

Quick benchmarking script

using BenchmarkTools, Random, SortingAlgorithms
versioninfo()
for i in 2:5
    n = 17^i
    println("sort!(rand(Int128, $n))")
    v = rand(Int128, n)
    print("Default:           "); @btime sort!($v) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end

function setupv!(v)
    v[1:end-1] = 1:length(v)-1
    v[end] = 10
end

for i in 2:5
    n = 17^i
    println("sort!([1:$n;10])")
    v = [1:n;10]
    print("Default:           "); @btime sort!($v) setup=(setupv!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v; alg=QuickSort) setup=(setupv!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v; alg=BranchyPdqSort) setup=(setupv!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v; alg=BranchlessPdqSort) setup=(setupv!($v)) evals=1
end

for i in 2:5
    n = 17^i
    println("sort!(rand(Complex{Float64},$n), by=abs2)")
    v = rand(Complex{Float64},n)
    print("Default:           "); @btime sort!($v, by=abs2) setup=(rand!($v)) evals=1
    print("Quicksort:         "); @btime sort!($v, by=abs2; alg=QuickSort) setup=(rand!($v)) evals=1
    print("BranchyPdqSort:    "); @btime sort!($v, by=abs2; alg=BranchyPdqSort) setup=(rand!($v)) evals=1
    print("BranchlessPdqSort: "); @btime sort!($v, by=abs2; alg=BranchlessPdqSort) setup=(rand!($v)) evals=1
end
Result on the 3900x (b603963)
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 12 on 24 virtual cores      
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12
sort!(rand(Int128, 289))
Default:             5.400 μs (1 allocation: 4.69 KiB)
Quicksort:           6.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      6.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   4.700 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 4913))
Default:             119.900 μs (2 allocations: 76.86 KiB)
Quicksort:           165.500 μs (0 allocations: 0 bytes)
BranchyPdqSort:      171.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   103.500 μs (2 allocations: 1.06 KiB)
sort!(rand(Int128, 83521))
Default:             2.711 ms (2 allocations: 1.27 MiB)
Quicksort:           4.190 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.260 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   2.294 ms (2 allocations: 1.06 KiB)
sort!(rand(Int128, 1419857))
Default:             60.424 ms (2 allocations: 21.67 MiB)
Quicksort:           89.065 ms (0 allocations: 0 bytes)
BranchyPdqSort:      91.504 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   46.587 ms (2 allocations: 1.06 KiB)
sort!([1:289;10])
Default:             1.200 μs (2 allocations: 2.77 KiB)
Quicksort:           1.200 μs (0 allocations: 0 bytes)
BranchyPdqSort:      1.000 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   1.200 μs (2 allocations: 1.06 KiB)
sort!([1:4913;10])
Default:             19.900 μs (3 allocations: 39.55 KiB)
Quicksort:           29.800 μs (0 allocations: 0 bytes)
BranchyPdqSort:      14.900 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   15.000 μs (2 allocations: 1.06 KiB)
sort!([1:83521;10])
Default:             448.400 μs (3 allocations: 656.80 KiB)
Quicksort:           651.300 μs (0 allocations: 0 bytes)
BranchyPdqSort:      257.300 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   257.700 μs (2 allocations: 1.06 KiB)
sort!([1:1419857;10])
Default:             13.690 ms (3 allocations: 10.83 MiB)
Quicksort:           13.954 ms (0 allocations: 0 bytes)
BranchyPdqSort:      4.400 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   4.178 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},289), by=abs2)
Default:             8.200 μs (1 allocation: 4.69 KiB)
Quicksort:           11.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      11.800 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   7.900 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},4913), by=abs2)
Default:             211.000 μs (2 allocations: 76.86 KiB)
Quicksort:           314.600 μs (0 allocations: 0 bytes)
BranchyPdqSort:      323.200 μs (0 allocations: 0 bytes)
BranchlessPdqSort:   179.800 μs (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},83521), by=abs2)
Default:             5.041 ms (2 allocations: 1.27 MiB)
Quicksort:           7.692 ms (0 allocations: 0 bytes)
BranchyPdqSort:      7.922 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   3.739 ms (2 allocations: 1.06 KiB)
sort!(rand(Complex{Float64},1419857), by=abs2)
Default:             109.978 ms (2 allocations: 21.67 MiB)
Quicksort:           166.103 ms (0 allocations: 0 bytes)
BranchyPdqSort:      172.113 ms (0 allocations: 0 bytes)
BranchlessPdqSort:   75.914 ms (2 allocations: 1.06 KiB)

Additional correctness checks

No stable sorting guarantee -> simpler checks.

test script
using Base.Order
using Test
using SortingAlgorithms
for T in (Float64, Int, UInt8, UInt128)
    println(T)
    for alg in [BranchlessPdqSort, BranchyPdqSort]
        for order in [Forward, Reverse, By(identity)]
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n)
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = rand(T, n) .% 4
                @test sort(v; order) == sort(v; alg, order)
            end
            for n in [0:1000; 2 .^(10:20); 2 .^(10:20).-1]
                v = [9;sort(rand(T, n));8]
                @test sort(v; order) == sort(v; alg, order)
            end
        end
    end
end

LSchwerdt avatar Feb 11 '23 10:02 LSchwerdt

Codecov Report

Merging #72 (8b44ef3) into master (2659334) will increase coverage by 0.96%. The diff coverage is 98.37%.

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   96.02%   96.98%   +0.96%     
==========================================
  Files           1        1              
  Lines         352      598     +246     
==========================================
+ Hits          338      580     +242     
- Misses         14       18       +4     
Impacted Files Coverage Δ
src/SortingAlgorithms.jl 96.98% <98.37%> (+0.96%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Feb 11 '23 11:02 codecov-commenter