Folding integer additions with operands of mixed bit widths

Open bjacob opened this issue 5 years ago • 1 comments

ARM NEON has pairwise-folding addition instructions where pairs of narrow (e.g. 8-bit) input lanes are added together and accumulated into wider (e.g. 16-bit) integer lanes. For example SADALP, SADDLP.

This is in addition to plain pairwise-folding additions with all operands of the same bit width, like SADDP.

An extreme case of such folding is the dot-product instructions (SDOT, See PR #127) where the folding addition is performed 4-fold. When one of the source operands has all lanes set to 1's, this acts as a 4-fold addition of 8bit values into 32bit accumulators.

This combination of folding behavior and mixing different bit widths allows to maximize the number of scalar operations done per instruction.

This is very widely used in any integer arithmetic application. For example in matrix multiplication kernels using plain NEON without SDOT, based on the idea of multiplying 8bit input values into 16bit local products (see Issue #226), then pairwise-folding those 16bit products into 32bit accumulators: https://github.com/google/ruy/blob/808ff748e0c7dc746a413fe45fa022d63e6253e8/ruy/kernel_arm64.cc#L1233

May 12 '20 18:05 bjacob

This is particularly covered by Extended Pairwise Addition instructions (#380)

Jan 14 '21 21:01 Maratyszcza