ImagePut icon indicating copy to clipboard operation
ImagePut copied to clipboard

Experiment with instruction level parallelism in pixelsearch1x.c

Open wind0204 opened this issue 1 year ago • 0 comments

Please see if this version has better performance than the non-parallel version if it interested you.

  • Dispatch 4 vector operations in each loop to allow a larger throughput in pixelsearch1x.c --I guess a CPU with decode width 5+ would accomplish the same throughput with just 2 vector operations per loop--
  • MOVMSKPS has twice the throughput of PMOVMSKB on AMD Zen2. --I guess it might help with the bottleneck on AMD Zen2--

Best regards.

wind0204 avatar Jan 02 '24 01:01 wind0204