sort: -g parallel processing
This analysis addresses issue #8061 by implementing parallel number parsing for sort -g (general numeric sort). The change adds multi-core processing to improve performance on large datasets.
Benchmarking Setup
Machine: Mac M1 running macOS Sequoia Version 15.6 (24G84)
Test Data: 387,597 floating-point numbers (matches original issue dataset size)
Tools: Hyperfine (--shell=none --warmup 3 --min-runs 5 --prepare 'sync')
Performance Results
Original Implementation (main branch):
hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' \
'./target/release/sort -g tmp/floatdata' \
'./target/release/sort -n tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
Time (mean ± σ): 117.7 ms ± 7.7 ms [User: 341.1 ms, System: 13.8 ms]
Range (min … max): 113.5 ms … 147.0 ms 17 runs
Benchmark 2: ./target/release/sort -n tmp/floatdata
Time (mean ± σ): 41.2 ms ± 1.0 ms [User: 93.1 ms, System: 11.5 ms]
Range (min … max): 38.7 ms … 44.0 ms 34 runs
Summary
./target/release/sort -n tmp/floatdata ran
2.85 ± 0.20 times faster than ./target/release/sort -g tmp/floatdata
Parallel Implementation (sort-parallelize branch):
hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' \
'./target/release/sort -g tmp/floatdata' \
'./target/release/sort -n tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
Time (mean ± σ): 84.3 ms ± 8.3 ms [User: 350.9 ms, System: 22.3 ms]
Range (min … max): 77.2 ms … 119.3 ms 22 runs
Benchmark 2: ./target/release/sort -n tmp/floatdata
Time (mean ± σ): 42.8 ms ± 1.7 ms [User: 93.7 ms, System: 11.9 ms]
Range (min … max): 39.6 ms … 47.9 ms 31 runs
Summary
./target/release/sort -n tmp/floatdata ran
1.97 ± 0.21 times faster than ./target/release/sort -g tmp/floatdata
Analysis
- sort -g performance: 117.7ms → 84.3ms (28.4% speedup)
- Gap reduction vs sort -n: 2.85× → 1.97× (31% smaller gap).
- Multi-core utilization: Increased User CPU time shows effective parallelization
| Metric | Sequential | Parallel | Change |
|---|---|---|---|
| sort -g time | 117.7 ± 7.7 ms | 84.3 ± 8.3 ms | +28.4% |
| sort -n time | 41.2 ± 1.0 ms | 42.8 ± 1.7 ms | ~same |
| Gap (n/g) | 2.85× | 1.97× | −31% |
| User CPU | 341.1 ms | 350.9 ms | (parallel work) |
GNU testsuite comparison:
Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Thanks for picking this up!
I'm getting somewhat strange results here, on an Intel i7-1370P, that has 6 Performance cores (*2 with Hyperthreading) and 8 Efficient cores.
Basically no improvement on average:
hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' -L sort ./target/release/sort,./sort.main '{sort} -g tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
Time (mean ± σ): 131.7 ms ± 17.3 ms [User: 1134.5 ms, System: 141.6 ms]
Range (min … max): 92.4 ms … 159.4 ms 21 runs
Benchmark 2: ./sort.main -g tmp/floatdata
Time (mean ± σ): 136.9 ms ± 12.9 ms [User: 494.2 ms, System: 31.1 ms]
Range (min … max): 121.3 ms … 165.9 ms 20 runs
Summary
./target/release/sort -g tmp/floatdata ran
1.04 ± 0.17 times faster than ./sort.main -g tmp/floatdata
But if I restrict to the P cores only (virtual cores 0-11, there are 6 physical cores), performance is good:
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' -L sort ./target/release/sort,./sort.main '{sort} -g tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
Time (mean ± σ): 74.2 ms ± 5.0 ms [User: 459.4 ms, System: 39.5 ms]
Range (min … max): 70.0 ms … 94.8 ms 37 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
Benchmark 2: ./sort.main -g tmp/floatdata
Time (mean ± σ): 115.7 ms ± 3.1 ms [User: 387.1 ms, System: 18.3 ms]
Range (min … max): 110.4 ms … 121.7 ms 25 runs
Summary
./target/release/sort -g tmp/floatdata ran
1.56 ± 0.11 times faster than ./sort.main -g tmp/floatdata
Adding in more and more E cores (12-19) to the mix makes things worse and worse:
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12 hyperfine ...
./target/release/sort -g tmp/floatdata ran
1.23 ± 0.14 times faster than ./sort.main -g tmp/floatdata
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13 hyperfine ...
./target/release/sort -g tmp/floatdata ran
1.04 ± 0.16 times faster than ./sort.main -g tmp/floatdata
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 hyperfine -...
./target/release/sort,./sort.main '{sort} -g tmp/floatdata'
1.01 ± 0.12 times faster than ./sort.main -g tmp/floatdata
I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you.
GNU testsuite comparison:
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
GNU testsuite comparison:
Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)
Thanks for picking this up!
ofc! not many opportunities to work on parallelism :)
I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you.
I suspect that the intel-E cores are much slower than the P-cores, which I believe differs than how apple's E/P cores work. I believe the M1's scheduler and memory system are more forgiving, where on i7-1370P the E-cores contend for bandwidth and extend the tail.
based off your results, i noticed the huge jump in user time with little wall-time gain when E-cores are present:
- With all cores: 1134.5ms User time but only 1.04x speedup
- P-cores only: 459.4ms User time with 1.56x speedup
showing that parallel work is being done, but my hunch is that the E-cores are creating a bottleneck. maybe no memory-bandwidth bound + merge i was doing?
so including them in workloads actually hurt the performance, which is great to know, thanks for pointing that out!
i tried making some changes based on this in the latest commit. will continue iterating with any feedback