highway icon indicating copy to clipboard operation
highway copied to clipboard

contrib/sort: enable f16/f64 sorting on no native FP

Open seiko2plus opened this issue 3 months ago • 11 comments

Add an integer-domain total-order comparator and use it for float16_t/double on targets lacking native fp operations. Preserves IEEE intent (e.g., -0.0 < +0.0); Wires into sort/select traits, removes FP-only guards, and enables basic tests.

This is part of enabling Highway vqsort for NumPy on x86, as intel-x86-sort provides emulation for half-precision

seiko2plus avatar Sep 28 '25 11:09 seiko2plus

but the -0 != 0 part is surprising, do we really need that?

IMHO, treating -0.0 < 0.0, and -0.0 != 0.0 as true is numerically reasonable for sorting. Enforcing (-0 == 0) == true and (-0 < 0) == false would require extra mask/blend steps to emulate floating-point behavior, which seems unnecessary here and adds overhead. If I’m not mistaken, the ordering of ±0 is currently unspecified, so allowing -0.0 < 0.0 is acceptable and improves performance.

For reference, here’s a benchmark: hwy::vqsort vs Intel’s sort using the NumPy benchmark suite.

NumPy’s upcoming patch (abandoning x86-simd-sort), tested on an AMD Ryzen 7 7700X (no native FP16 support), Clang 19.1.7

| Change   | Before [4486271c] <hwy-x86-vqsort~4>   | After [3310265e] <hwy-x86-vqsort>   |   Ratio | Benchmark (Parameter)                                                                    |
|----------|----------------------------------------|-------------------------------------|---------|------------------------------------------------------------------------------------------|
| +        | 19.4±0.08ms                            | 23.5±0.3ms                          |    1.21 | bench_function_base.Sort.time_sort('heap', 'uint32', ('reversed',))                      |
| +        | 19.4±0.01ms                            | 23.5±0.3ms                          |    1.21 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))                      |
| +        | 19.4±0.01ms                            | 23.4±0.7ms                          |    1.2  | bench_function_base.Sort.time_sort('heap', 'int32', ('reversed',))                       |
| +        | 19.5±0.01ms                            | 23.4±0.6ms                          |    1.2  | bench_function_base.Sort.time_sort('quick', 'uint32', ('ordered',))                      |
| +        | 19.4±0.01ms                            | 23.3±0.4ms                          |    1.2  | bench_function_base.Sort.time_sort('quick', 'uint32', ('reversed',))                     |
| +        | 19.5±0.01ms                            | 23.1±0.2ms                          |    1.19 | bench_function_base.Sort.time_sort('heap', 'int32', ('ordered',))                        |
| +        | 19.5±0.01ms                            | 23.1±0.3ms                          |    1.19 | bench_function_base.Sort.time_sort('heap', 'uint32', ('ordered',))                       |
| +        | 19.5±0.01ms                            | 23.1±0.3ms                          |    1.19 | bench_function_base.Sort.time_sort('quick', 'int32', ('ordered',))                       |
| +        | 598±20μs                               | 701±40μs                            |    1.17 | bench_function_base.Sort.time_argsort('merge', 'float32', ('ordered',))                  |
| +        | 19.9±0.1ms                             | 23.4±0.2ms                          |    1.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('ordered',))                      |
| +        | 19.9±0.07ms                            | 23.3±0.3ms                          |    1.17 | bench_function_base.Sort.time_sort('heap', 'float32', ('reversed',))                     |
| +        | 19.9±0.04ms                            | 23.3±0.5ms                          |    1.17 | bench_function_base.Sort.time_sort('quick', 'float32', ('reversed',))                    |
| +        | 19.9±0.02ms                            | 23.1±0.6ms                          |    1.16 | bench_function_base.Sort.time_sort('quick', 'float32', ('ordered',))                     |
| +        | 205±0.6μs                              | 229±0.6μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 10)   |
| +        | 206±0.5μs                              | 229±0.3μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 100)  |
| +        | 207±0.08μs                             | 231±0.2μs                           |    1.11 | bench_function_base.Partition.time_argpartition('float16', ('sorted_block', 1000), 1000) |
| +        | 20.3±0.03ms                            | 22.5±0.05ms                         |    1.11 | bench_function_base.Sort.time_sort('heap', 'int32', ('random',))                         |
| +        | 20.3±0.08ms                            | 22.5±0.03ms                         |    1.11 | bench_function_base.Sort.time_sort('quick', 'int32', ('random',))                        |
| +        | 20.3±0.01ms                            | 22.5±0.05ms                         |    1.11 | bench_function_base.Sort.time_sort('quick', 'uint32', ('random',))                       |
| +        | 20.4±0.1ms                             | 22.5±0.03ms                         |    1.1  | bench_function_base.Sort.time_sort('heap', 'uint32', ('random',))                        |
| +        | 20.9±0.1ms                             | 22.8±0.02ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'float32', ('random',))                       |
| +        | 21.2±0.01ms                            | 23.1±0.05ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 10))                |
| +        | 21.2±0.02ms                            | 23.2±0.03ms                         |    1.09 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 10))               |
| +        | 20.9±0.08ms                            | 22.8±0.2ms                          |    1.09 | bench_function_base.Sort.time_sort('quick', 'float32', ('random',))                      |
| +        | 21.3±0.03ms                            | 23.3±0.04ms                         |    1.09 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 10))               |
| +        | 21.3±0.06ms                            | 23.1±0.2ms                          |    1.09 | bench_function_base.Sort.time_sort('quick', 'int32', ('sorted_block', 100))              |
| +        | 21.3±0.04ms                            | 23.1±0.06ms                         |    1.09 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 10))              |
| +        | 21.8±0.03ms                            | 23.5±0.05ms                         |    1.08 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 10))              |
| +        | 21.3±0.04ms                            | 22.9±0.07ms                         |    1.08 | bench_function_base.Sort.time_sort('heap', 'int32', ('sorted_block', 100))               |
| +        | 21.8±0.03ms                            | 23.6±0.07ms                         |    1.08 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 10))             |
| +        | 21.3±0.03ms                            | 22.9±0.3ms                          |    1.08 | bench_function_base.Sort.time_sort('quick', 'uint32', ('sorted_block', 100))             |
| +        | 21.3±0.04ms                            | 22.8±0.1ms                          |    1.07 | bench_function_base.Sort.time_sort('heap', 'uint32', ('sorted_block', 100))              |
| +        | 608±20μs                               | 644±20μs                            |    1.06 | bench_function_base.Sort.time_argsort('merge', 'float32', ('uniform',))                  |
| +        | 21.9±0.05ms                            | 23.2±0.1ms                          |    1.06 | bench_function_base.Sort.time_sort('heap', 'float32', ('sorted_block', 100))             |
| +        | 21.8±0.05ms                            | 23.2±0.07ms                         |    1.06 | bench_function_base.Sort.time_sort('quick', 'float32', ('sorted_block', 100))            |
| -        | 79.9±0.1μs                             | 73.5±0.2μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 10)       |
| -        | 79.9±0.1μs                             | 73.3±0.1μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 100)      |
| -        | 80.9±0.2μs                             | 74.2±0.2μs                          |    0.92 | bench_function_base.Partition.time_argpartition('int16', ('sorted_block', 10), 1000)     |
| -        | 357±0.03μs                             | 321±3μs                             |    0.9  | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 100)      |
| -        | 360±0.09μs                             | 316±3μs                             |    0.88 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 1000)     |
| -        | 734±0.6μs                              | 645±3μs                             |    0.88 | bench_function_base.Sort.time_argsort('merge', 'float64', ('uniform',))                  |
| -        | 357±0.07μs                             | 310±5μs                             |    0.87 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 100), 10)       |
| -        | 311±0.2μs                              | 270±4μs                             |    0.87 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 10)        |
| -        | 734±0.8μs                              | 638±20μs                            |    0.87 | bench_function_base.Sort.time_argsort('merge', 'float64', ('ordered',))                  |
| -        | 81.0±0.3μs                             | 70.2±0.7μs                          |    0.87 | bench_function_base.Sort.time_sort('quick', 'float16', ('uniform',))                     |
| -        | 81.0±0.2μs                             | 70.0±0.6μs                          |    0.86 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))                      |
| -        | 312±0.2μs                              | 265±3μs                             |    0.85 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 100)       |
| -        | 310±0.2μs                              | 261±8μs                             |    0.84 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 1000), 1000)      |
| -        | 344±0.08μs                             | 270±3μs                             |    0.79 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 10)      |
| -        | 343±0.07μs                             | 270±5μs                             |    0.79 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 1000)    |
| -        | 439±0.09μs                             | 342±2μs                             |    0.78 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 10)          |
| -        | 438±0.2μs                              | 343±3μs                             |    0.78 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 1000)        |
| -        | 344±0.1μs                              | 265±7μs                             |    0.77 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 1000), 100)     |
| -        | 439±0.2μs                              | 338±3μs                             |    0.77 | bench_function_base.Partition.time_partition('int32', ('sorted_block', 10), 100)         |
| -        | 474±0.07μs                             | 344±0.9μs                           |    0.73 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 10)        |
| -        | 474±0.09μs                             | 348±2μs                             |    0.73 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 100)       |
| -        | 473±0.08μs                             | 342±2μs                             |    0.72 | bench_function_base.Partition.time_partition('float32', ('sorted_block', 10), 1000)      |
| -        | 14.6±0.01ms                            | 5.04±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('random',))                         |
| -        | 14.8±0.02ms                            | 5.17±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 100))               |
| -        | 14.7±0.03ms                            | 5.13±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 1000))              |
| -        | 14.6±0.01ms                            | 5.04±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('random',))                        |
| -        | 14.8±0.02ms                            | 5.16±0.01ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 100))              |
| -        | 14.8±0.02ms                            | 5.15±0.04ms                         |    0.35 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 1000))             |
| -        | 15.0±0.01ms                            | 5.18±0.01ms                         |    0.34 | bench_function_base.Sort.time_sort('heap', 'int16', ('sorted_block', 10))                |
| -        | 15.0±0.01ms                            | 5.17±0.01ms                         |    0.34 | bench_function_base.Sort.time_sort('quick', 'int16', ('sorted_block', 10))               |
| -        | 225±0.07μs                             | 72.2±0.1μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 10)          |
| -        | 225±0.04μs                             | 72.2±0.2μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 100)         |
| -        | 225±0.08μs                             | 72.6±0.2μs                          |    0.32 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 10), 1000)        |
| -        | 15.0±0ms                               | 4.72±0.04ms                         |    0.32 | bench_function_base.Sort.time_sort('quick', 'int16', ('ordered',))                       |
| -        | 15.0±0ms                               | 4.71±0.07ms                         |    0.31 | bench_function_base.Sort.time_sort('heap', 'int16', ('ordered',))                        |
| -        | 246±0.1μs                              | 69.9±0.6μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 10)         |
| -        | 246±0.1μs                              | 69.2±0.5μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 100)        |
| -        | 247±0.05μs                             | 69.7±0.4μs                          |    0.28 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 100), 1000)       |
| -        | 209±0.1μs                              | 55.6±0.9μs                          |    0.27 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 10)        |
| -        | 209±0.2μs                              | 55.8±0.6μs                          |    0.27 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 1000)      |
| -        | 209±0.1μs                              | 55.0±0.2μs                          |    0.26 | bench_function_base.Partition.time_partition('int16', ('sorted_block', 1000), 100)       |
| -        | 2.33±0ms                               | 514±10μs                            |    0.22 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))              |
| -        | 2.33±0ms                               | 523±4μs                             |    0.22 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 10))             |
| -        | 2.15±0.01ms                            | 440±3μs                             |    0.21 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',))                      |
| -        | 2.36±0.01ms                            | 495±6μs                             |    0.21 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))                       |
| -        | 2.15±0ms                               | 444±2μs                             |    0.21 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))                     |
| -        | 2.37±0ms                               | 499±10μs                            |    0.21 | bench_function_base.Sort.time_sort('quick', 'float16', ('random',))                      |
| -        | 2.31±0.01ms                            | 460±4μs                             |    0.2  | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 100))            |
| -        | 2.33±0.01ms                            | 454±4μs                             |    0.19 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))             |
| -        | 36.5±0.05ms                            | 5.74±0.06ms                         |    0.16 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 10))              |
| -        | 37.0±0.08ms                            | 5.78±0.05ms                         |    0.16 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 100))             |
| -        | 35.1±0.08ms                            | 5.56±0.1ms                          |    0.16 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 10))                |
| -        | 36.5±0.06ms                            | 5.68±0.09ms                         |    0.16 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 10))             |
| -        | 37.1±0.1ms                             | 5.85±0.2ms                          |    0.16 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 100))            |
| -        | 35.1±0.08ms                            | 5.56±0.07ms                         |    0.16 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 10))               |
| -        | 33.8±0.3ms                             | 5.05±0.06ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('ordered',))                      |
| -        | 34.9±0.1ms                             | 5.12±0.02ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('random',))                       |
| -        | 34.0±0.4ms                             | 5.14±0.05ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('reversed',))                     |
| -        | 39.2±0.08ms                            | 5.97±0.08ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'float64', ('sorted_block', 1000))            |
| -        | 32.5±0.3ms                             | 4.94±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))                        |
| -        | 33.7±0.2ms                             | 4.98±0.01ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('random',))                         |
| -        | 32.8±0.3ms                             | 4.99±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))                       |
| -        | 35.6±0.07ms                            | 5.34±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 100))               |
| -        | 37.7±0.1ms                             | 5.85±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))              |
| -        | 33.8±0.2ms                             | 5.12±0.04ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))                     |
| -        | 35.0±0.2ms                             | 5.15±0.08ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('random',))                      |
| -        | 33.9±0.2ms                             | 5.17±0.04ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('reversed',))                    |
| -        | 39.3±0.1ms                             | 5.98±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'float64', ('sorted_block', 1000))           |
| -        | 32.4±0.1ms                             | 4.94±0.05ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('ordered',))                       |
| -        | 33.6±0.1ms                             | 5.19±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('random',))                        |
| -        | 32.6±0.2ms                             | 5.03±0.09ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('reversed',))                      |
| -        | 35.6±0.09ms                            | 5.50±0.06ms                         |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 100))              |
| -        | 37.8±0.08ms                            | 5.79±0.1ms                          |    0.15 | bench_function_base.Sort.time_sort('quick', 'int64', ('sorted_block', 1000))             |
| -        | 35.1±0.2ms                             | 5.31±0.03ms                         |    0.15 | bench_function_base.Sort.time_sort_worst                                                 |
| -        | 3.37±0.01ms                            | 440±4μs                             |    0.13 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))            |
| -        | 3.37±0.02ms                            | 437±6μs                             |    0.13 | bench_function_base.Sort.time_sort('quick', 'float16', ('sorted_block', 1000))           |
| -        | 331±0.03μs                             | 38.6±0.2μs                          |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 10)        |
| -        | 331±0.09μs                             | 38.6±0.06μs                         |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 100)       |
| -        | 331±0.08μs                             | 38.7±0.2μs                          |    0.12 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 10), 1000)      |
| -        | 323±0.1μs                              | 34.0±0.2μs                          |    0.11 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 10)       |
| -        | 323±0.08μs                             | 34.2±0.1μs                          |    0.11 | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 100)      |
| -        | 323±0.05μs                             | 33.8±0.07μs                         |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 100), 1000)     |
| -        | 288±0.2μs                              | 29.9±0.2μs                          |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 10)      |
| -        | 288±0.07μs                             | 29.9±0.2μs                          |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 100)     |
| -        | 287±0.05μs                             | 29.9±0.03μs                         |    0.1  | bench_function_base.Partition.time_partition('float16', ('sorted_block', 1000), 1000)    |
| -        | 537±1μs                                | 50.1±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 10)        |
| -        | 536±0.2μs                              | 50.1±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 100)       |
| -        | 536±0.1μs                              | 50.3±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 10), 1000)      |
| -        | 484±0.7μs                              | 44.6±0.07μs                         |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 10)          |
| -        | 484±0.2μs                              | 44.6±0.2μs                          |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 100)         |
| -        | 485±0.2μs                              | 44.6±0.1μs                          |    0.09 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 10), 1000)        |
| -        | 639±0.7μs                              | 48.8±0.3μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 10)      |
| -        | 639±0.8μs                              | 48.5±0.5μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 100)     |
| -        | 635±0.08μs                             | 48.9±0.2μs                          |    0.08 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 1000), 1000)    |
| -        | 582±0.5μs                              | 43.7±0.3μs                          |    0.08 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 100)       |
| -        | 578±0.9μs                              | 43.8±0.2μs                          |    0.08 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 1000)      |
| -        | 581±0.5μs                              | 43.4±0.1μs                          |    0.07 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 1000), 10)        |
| -        | 807±0.1μs                              | 51.6±0.4μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 10)       |
| -        | 807±0.3μs                              | 51.2±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 100)      |
| -        | 808±0.4μs                              | 52.0±0.4μs                          |    0.06 | bench_function_base.Partition.time_partition('float64', ('sorted_block', 100), 1000)     |
| -        | 743±0.7μs                              | 46.6±0.2μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 10)         |
| -        | 743±0.5μs                              | 46.6±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 100)        |
| -        | 745±1μs                                | 46.1±0.3μs                          |    0.06 | bench_function_base.Partition.time_partition('int64', ('sorted_block', 100), 1000)       |

seiko2plus avatar Sep 29 '25 11:09 seiko2plus

FWIW -0 vs 0 was not specified because we just went with what IEEE754 says. So the main driver seems to be that we just want to do the integer comparison after conditionally flipping bits. But the extra check might not add much to the cost of the bit flipping, it's one extra comparison which can run in parallel, plus a masked XOR (somehow we only have MaskedOr, not MaskedXor, but that can be added). What do you think, would it be worth a try to see how (in)expensive that is?

jan-wassenberg avatar Sep 29 '25 12:09 jan-wassenberg

As discussed, I understand that the ordering of equivalent keys can anyway change in quicksort, so it's fine to keep the current -0 != 0.

jan-wassenberg avatar Sep 29 '25 17:09 jan-wassenberg

apart from that good to go.

Not, yet. CI trigger errors due to missing float16_t's operator+ by clang & gcc compiler: https://github.com/google/highway/actions/runs/18126110238/job/51581687220?pr=2737

/home/runner/work/highway/highway/hwy/contrib/sort/sort_unit_test.cc:84:73: error: invalid operands to binary expression ('hwy::float16_t' and 'hwy::float16_t')
    const Vec<D> epsp1 = Set(d, BitCastScalar<T>(ConvertScalarTo<TF>(1) + hwy::Epsilon<TF>()));
                                                 ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:265:7: note: in instantiation of function template specialization 'hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller::operator()<hwy::float16_t, hwy::N_SSSE3::Simd<hwy::float16_t, 8, 0>>' requested here
      Test()(T(), d);
      ^
/home/runner/work/highway/highway/hwy/tests/test_util-inl.h:411:61: note: in instantiation of member function 'hwy::N_SSSE3::detail::ForeachCappedR<hwy::float16_t, 8, 1, hwy::N_SSSE3::(anonymous namespace)::TestFloatLargerSmaller, 0>::Do' requested here

seems the following branch need to refine: https://github.com/google/highway/blob/0bb1960251e6529bab84daa12c6717f46131e397/hwy/base.h#L1228-L1250

seiko2plus avatar Sep 30 '25 11:09 seiko2plus

I'm going to avoid using operator + in here instead for now to bypass ci errors:

https://github.com/google/highway/blob/746c513ae5486c9d955f3bbaae7b746b4fb30b23/hwy/contrib/sort/sort_unit_test.cc#L80-L85

seiko2plus avatar Sep 30 '25 11:09 seiko2plus

I'm going to avoid using operator + in here instead for now to bypass ci errors:

https://github.com/google/highway/blob/746c513ae5486c9d955f3bbaae7b746b4fb30b23/hwy/contrib/sort/sort_unit_test.cc#L80-L85

Agreed, what we usually do for bf16 and fp16 is to ConvertScalarTo<float>, add, then convert back to the actual type.

jan-wassenberg avatar Sep 30 '25 11:09 jan-wassenberg

To clarify:

  • With flush-to-zero enabled, this emulation still handles subnormals, so it’s fine to treat them as integers by the following branches: https://github.com/google/highway/blob/0bb1960251e6529bab84daa12c6717f46131e397/hwy/contrib/sort/vqsort-inl.h#L1283-L1290

  • https://github.com/google/highway/blob/0bb1960251e6529bab84daa12c6717f46131e397/hwy/contrib/sort/vqsort-inl.h#L1434-L1458

  • NumPy CI (numpy/numpy#29829) triggered errors, mostly related to select and MMX prefetch. I’ll follow up with fixes after addressing these issues.

seiko2plus avatar Sep 30 '25 12:09 seiko2plus

NumPy CI (https://github.com/numpy/numpy/pull/29829) triggered errors, mostly related to select and MMX prefetch. I’ll follow up with fixes after addressing these issues.

Regarding VQselect, the runtime test failure was in NumPy’s test itself—since the order within partitions is undefined.

Now, all expected tests passed except for a _mm_prefetch build error on clang-cl:

umpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
  102 |   _mm_prefetch(reinterpret_cast<const char*>(p), _MM_HINT_T0);
      |   ^
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx
..\numpy\_core\src\highway\hwy/cache_control.h(102,3): error: '_mm_prefetch' needs target feature mmx

We pass -mno-mmx when targeting AVX512* due to a previously discovered bug that corrupts the x87 FP stack; I haven’t revisited this since. I don’t know why clang-cl requires MMX—wouldn’t SSE be enough? (Possibly something related to Windows) In any case, I think this PR is ready to go. I’ll follow up with another PR to fix this clang-cl build error by falling back to __builtin_prefetch when clang-cl is built with -mno-mmx.

seiko2plus avatar Sep 30 '25 17:09 seiko2plus

Internal tests/builds are unfortunately failing due to a build timeout in optimized asan builds for the f16 file. What do you think of splitting it up into separate files for Sort vs Select vs PartialSort? Disabling f16 for asan is probably not desirable. Any other way we can reduce compile time?

jan-wassenberg avatar Oct 01 '25 15:10 jan-wassenberg

What do you think of splitting it up into separate files for Sort vs Select vs PartialSort?

That could be a straightforward fix. The current CPU dispatch approach can exhaust available physical memory, forcing the system to fall back on swap, which is likely why may hitting the limit.

Have you tried reducing the build parallelism e.g. -j4? That could be a quick fix.

Any other way we can reduce compile time?

Using PCH could help. Let me give it a try.

seiko2plus avatar Oct 01 '25 17:10 seiko2plus

Ah, possible that swapping might slow things down, yes. Unfortunately we don't have control over the build machines.

Good point about PCH. We can least clang header modules in the BUILD file. I'll try that.

jan-wassenberg avatar Oct 02 '25 14:10 jan-wassenberg