IntrinsicsPlayground icon indicating copy to clipboard operation
IntrinsicsPlayground copied to clipboard

Performance improvements using alignment and pipelining

Open Metalnem opened this issue 5 years ago • 4 comments

Hi Egor,

I wrote a blog post about alignment and pipelining, which you could use to further boost the performance of most examples in your repository. In essence, you can transform this code:

public static int Sum(int[] source)
{
  const int VectorSizeInInts = 8;

  fixed (int* ptr = &source[0])
  {
    var pos = 0;
    var sum = Avx.SetZeroVector256<int>();

    for (; pos <= source.Length - VectorSizeInInts; pos += VectorSizeInInts)
    {
      var current = Avx.LoadVector256(ptr + pos);
      sum = Avx2.Add(current, sum);
    }

    var temp = stackalloc int[VectorSizeInInts];
    Avx.Store(temp, sum);

    var final = Sum(temp, VectorSizeInInts);
    final += Sum(ptr + pos, source.Length - pos);

    return final;
  }
}

Into this:

public static int SumAlignedPipelined(int[] source)
{
  const ulong AlignmentMask = 31UL;
  const int VectorSizeInInts = 8;
  const int BlockSizeInInts = 32;

  fixed (int* ptr = &source[0])
  {
    var aligned = (int*)(((ulong)ptr + AlignmentMask) & ~AlignmentMask);
    var pos = (int)(aligned - ptr);
    var sum = Avx.SetZeroVector256<int>();
    var final = Sum(ptr, pos);

    for (; pos <= source.Length - BlockSizeInInts; pos += BlockSizeInInts)
    {
      var block0 = Avx.LoadAlignedVector256(ptr + pos + 0 * VectorSizeInInts);
      var block1 = Avx.LoadAlignedVector256(ptr + pos + 1 * VectorSizeInInts);
      var block2 = Avx.LoadAlignedVector256(ptr + pos + 2 * VectorSizeInInts);
      var block3 = Avx.LoadAlignedVector256(ptr + pos + 3 * VectorSizeInInts);

      sum = Avx2.Add(block0, sum);
      sum = Avx2.Add(block1, sum);
      sum = Avx2.Add(block2, sum);
      sum = Avx2.Add(block3, sum);
    }

    for (; pos <= source.Length - VectorSizeInInts; pos += VectorSizeInInts)
    {
      var current = Avx.LoadAlignedVector256(ptr + pos);
      sum = Avx2.Add(current, sum);
    }

    var temp = stackalloc int[VectorSizeInInts];
    Avx.Store(temp, sum);

    final += Sum(temp, VectorSizeInInts);
    final += Sum(ptr + pos, source.Length - pos);

    return final;
  }
}

On my machine this results in 27% boost in performance when working with aligned arrays, and 34% on unaligned arrays (some arrays in your benchmarks are aligned, and some are not, which can results in around 10% difference in performance).

I don't have the time to send you a pull request, but I though you might be interested in this.

Metalnem avatar Jul 20 '18 18:07 Metalnem

Wow, amazing!! @Metalnem I'll check it out!

EgorBo avatar Jul 25 '18 09:07 EgorBo

I don't see the method Avx.SetZeroVector256... 'Avx' does not contain a definition for 'SetZeroVector256'. I also can barely find any reference to it in Google.

@Floccinaucinihilipilification11 this method existed in a (early) preview version of HW-intrinsics in .NET Core 3.0, which was later removed / replaced by Vector256<T>.Zero.

gfoidl avatar Nov 05 '20 14:11 gfoidl

@gfoidl Thanks, documentation on SIMD is scant to say the least, so I appreciate the reply.