coreutils icon indicating copy to clipboard operation
coreutils copied to clipboard

sort: -g parallel processing

Open hz2 opened this issue 5 months ago • 5 comments

This analysis addresses issue #8061 by implementing parallel number parsing for sort -g (general numeric sort). The change adds multi-core processing to improve performance on large datasets.

Benchmarking Setup

Machine: Mac M1 running macOS Sequoia Version 15.6 (24G84) Test Data: 387,597 floating-point numbers (matches original issue dataset size) Tools: Hyperfine (--shell=none --warmup 3 --min-runs 5 --prepare 'sync')

Performance Results

Original Implementation (main branch):

hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' \
  './target/release/sort -g tmp/floatdata' \
  './target/release/sort -n tmp/floatdata'

Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):     117.7 ms ±   7.7 ms    [User: 341.1 ms, System: 13.8 ms]
  Range (min … max):   113.5 ms … 147.0 ms    17 runs

Benchmark 2: ./target/release/sort -n tmp/floatdata
  Time (mean ± σ):      41.2 ms ±   1.0 ms    [User: 93.1 ms, System: 11.5 ms]
  Range (min … max):    38.7 ms …  44.0 ms    34 runs

Summary
  ./target/release/sort -n tmp/floatdata ran
    2.85 ± 0.20 times faster than ./target/release/sort -g tmp/floatdata

Parallel Implementation (sort-parallelize branch):

hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' \
  './target/release/sort -g tmp/floatdata' \
  './target/release/sort -n tmp/floatdata'

Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):      84.3 ms ±   8.3 ms    [User: 350.9 ms, System: 22.3 ms]
  Range (min … max):    77.2 ms … 119.3 ms    22 runs

Benchmark 2: ./target/release/sort -n tmp/floatdata
  Time (mean ± σ):      42.8 ms ±   1.7 ms    [User: 93.7 ms, System: 11.9 ms]
  Range (min … max):    39.6 ms …  47.9 ms    31 runs

Summary
  ./target/release/sort -n tmp/floatdata ran
    1.97 ± 0.21 times faster than ./target/release/sort -g tmp/floatdata

Analysis

  • sort -g performance: 117.7ms → 84.3ms (28.4% speedup)
  • Gap reduction vs sort -n: 2.85× → 1.97× (31% smaller gap).
  • Multi-core utilization: Increased User CPU time shows effective parallelization
Metric Sequential Parallel Change
sort -g time 117.7 ± 7.7 ms 84.3 ± 8.3 ms +28.4%
sort -n time 41.2 ± 1.0 ms 42.8 ± 1.7 ms ~same
Gap (n/g) 2.85× 1.97× −31%
User CPU 341.1 ms 350.9 ms (parallel work)

hz2 avatar Aug 10 '25 21:08 hz2

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Aug 10 '25 22:08 github-actions[bot]

Thanks for picking this up!

I'm getting somewhat strange results here, on an Intel i7-1370P, that has 6 Performance cores (*2 with Hyperthreading) and 8 Efficient cores.

Basically no improvement on average:

hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' -L sort ./target/release/sort,./sort.main  '{sort} -g tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):     131.7 ms ±  17.3 ms    [User: 1134.5 ms, System: 141.6 ms]
  Range (min … max):    92.4 ms … 159.4 ms    21 runs
 
Benchmark 2: ./sort.main -g tmp/floatdata
  Time (mean ± σ):     136.9 ms ±  12.9 ms    [User: 494.2 ms, System: 31.1 ms]
  Range (min … max):   121.3 ms … 165.9 ms    20 runs
 
Summary
  ./target/release/sort -g tmp/floatdata ran
    1.04 ± 0.17 times faster than ./sort.main -g tmp/floatdata

But if I restrict to the P cores only (virtual cores 0-11, there are 6 physical cores), performance is good:

taskset -c 0,1,2,3,4,5,6,7,8,9,10,11 hyperfine --shell=none --warmup 3 --min-runs 5 --prepare 'sync' -L sort ./target/release/sort,./sort.main  '{sort} -g tmp/floatdata'
Benchmark 1: ./target/release/sort -g tmp/floatdata
  Time (mean ± σ):      74.2 ms ±   5.0 ms    [User: 459.4 ms, System: 39.5 ms]
  Range (min … max):    70.0 ms …  94.8 ms    37 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
Benchmark 2: ./sort.main -g tmp/floatdata
  Time (mean ± σ):     115.7 ms ±   3.1 ms    [User: 387.1 ms, System: 18.3 ms]
  Range (min … max):   110.4 ms … 121.7 ms    25 runs
 
Summary
  ./target/release/sort -g tmp/floatdata ran
    1.56 ± 0.11 times faster than ./sort.main -g tmp/floatdata

Adding in more and more E cores (12-19) to the mix makes things worse and worse:

taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12 hyperfine ...
  ./target/release/sort -g tmp/floatdata ran
    1.23 ± 0.14 times faster than ./sort.main -g tmp/floatdata
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13 hyperfine ... 
  ./target/release/sort -g tmp/floatdata ran
    1.04 ± 0.16 times faster than ./sort.main -g tmp/floatdata
taskset -c 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14 hyperfine -...
 ./target/release/sort,./sort.main  '{sort} -g tmp/floatdata'
    1.01 ± 0.12 times faster than ./sort.main -g tmp/floatdata

I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you.

drinkcat avatar Aug 11 '25 02:08 drinkcat

GNU testsuite comparison:

Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Aug 11 '25 17:08 github-actions[bot]

GNU testsuite comparison:

Skip an intermittent issue tests/misc/stdbuf (fails in this run but passes in the 'main' branch)
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

github-actions[bot] avatar Aug 11 '25 18:08 github-actions[bot]

Thanks for picking this up!

ofc! not many opportunities to work on parallelism :)

I'm somewhat surprised as the M1 CPU also has heterogeneous cores, but for some reason that does not cause problems for you.

I suspect that the intel-E cores are much slower than the P-cores, which I believe differs than how apple's E/P cores work. I believe the M1's scheduler and memory system are more forgiving, where on i7-1370P the E-cores contend for bandwidth and extend the tail.

based off your results, i noticed the huge jump in user time with little wall-time gain when E-cores are present:

  • With all cores: 1134.5ms User time but only 1.04x speedup
  • P-cores only: 459.4ms User time with 1.56x speedup

showing that parallel work is being done, but my hunch is that the E-cores are creating a bottleneck. maybe no memory-bandwidth bound + merge i was doing?

so including them in workloads actually hurt the performance, which is great to know, thanks for pointing that out!

i tried making some changes based on this in the latest commit. will continue iterating with any feedback

hz2 avatar Aug 11 '25 18:08 hz2