ImagePut
ImagePut copied to clipboard
Experiment with instruction level parallelism in pixelsearch1x.c
Please see if this version has better performance than the non-parallel version if it interested you.
- Dispatch 4 vector operations in each loop to allow a larger throughput in pixelsearch1x.c --I guess a CPU with decode width 5+ would accomplish the same throughput with just 2 vector operations per loop--
- MOVMSKPS has twice the throughput of PMOVMSKB on AMD Zen2. --I guess it might help with the bottleneck on AMD Zen2--
Best regards.