x86-simd-sort
x86-simd-sort copied to clipboard
C++ template library for high performance SIMD based sorting algorithms
32-bit argsort uses ymm registers: we can switch to zmm registers (use 2x i64gather instructions) and add new bitonic networks.
Have you tried comparing to: https://github.com/damageboy/vxsort-cpp
This lib looks awesome! But I have a requirement to output the original index of these sorted data, for example the input array is {300 200 400 99 150 50...
Hello, I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is...
I suspect this function https://github.com/intel/x86-simd-sort/blob/7d7591cf5927e83e4a1e7c4b6f2c4dc91a97889f/src/avx512-16bit-qsort.hpp#L65 can be improved with fewer operations. See: https://github.com/numpy/numpy/blob/0bd56e7ec12f8ceeb8d082340e71e60b873d5c57/numpy/core/src/npysort/npysort_common.h#L153 for reference.
Dear x86-simd-sort maintainers, I am currently working on a high-performance sorting algorithm to handle billions of uint64_t data entries. To optimize the sorting process, I am leveraging parallel execution and...
This patch rewrites all of the single vector sorting and bitonic merging to use swizzle ops and generic masks to reduce code duplication. It also centralizes all of this logic...
Fixes the bug with nested OpenMP by adding #pragma omp taskwait Changes the task_threshold when OpenMP is enabled but parallelization isn't chosen from 0 to the max value for arrsize_t;...
Fixes some of the CI failures we are seeing with missing g++-13