simde icon indicating copy to clipboard operation
simde copied to clipboard

_mm_testc_si128 and _mm_testnzc_si128 faulty for arm

Open p0nce opened this issue 2 years ago • 0 comments

Hello,

I implement intrinsincs in D and reference simd-everywhere a lot for arm versions of the intrinsics. Thanks @nemequ for that! It is tremendously helpful.

I believe the SIMDE_ARM_NEON_A32V7_NATIVE versions of simde_mm_testc_si128 and _mm_testnzc_si128 are faulty in simd-everywhere.

About _mm_testc_si128

This intrinsic tests if all bits masked by b are 1's in a.

Implementation is:

int64x2_t s64 = vbicq_s64(b_.neon_i64, a_.neon_i64);
return !(vgetq_lane_s64(s64, 0) & vgetq_lane_s64(s64, 1));

I suggest the following fix:

int64x2_t s64 = vbicq_s64(b_.neon_i64, a_.neon_i64);
return !(vgetq_lane_s64(s64, 0) | vgetq_lane_s64(s64, 1));

with a | instead of &.

Indeed spec says: "Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value."

The unittest that did reveal this bug in D is:

unittest
{
    __m128i A  = _mm_setr_epi32(0x01, 0x02, 0x04, 0xf8);
    __m128i M1 = _mm_setr_epi32(0xfe, 0xfd, 0x00, 0x00);
    __m128i M2 = _mm_setr_epi32(0x00, 0x00, 0x04, 0x00);
    assert(_mm_testc_si128(A, A) == 1);
    assert(_mm_testc_si128(A, M1) == 0);
    assert(_mm_testc_si128(A, M2) == 1);
}

In my testing this makes the intrinsic agree with the instruction.

About the barely useful _mm_testnzc_si128

(not 100% sure about this one) This intrinsic tests if all bits masked by b have both 0's and 1's in a, but are not only 0's or only 1's.

A similar problem exist with _mm_testnzc_si128 arm version. Implementation is:

int64x2_t s640 = vandq_s64(b_.neon_i64, a_.neon_i64);
int64x2_t s641 = vbicq_s64(b_.neon_i64, a_.neon_i64);
return (((vgetq_lane_s64(s640, 0) | vgetq_lane_s64(s640, 1)) & (vgetq_lane_s64(s641, 0) | vgetq_lane_s64(s641, 1)))!=0);

The guide says (warning headache):

Compute the bitwise AND of 128 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

I suggest the following fix (D version):

long2 s640 = vandq_s64(cast(long2)b, cast(long2)a);
long2 s641 = vbicq_s64(cast(long2)b, cast(long2)a);
return !( !(vgetq_lane_s64(s641, 0) | vgetq_lane_s64(s641, 1))
            | !(vgetq_lane_s64(s640, 0) | vgetq_lane_s64(s640, 1)) );

The unittest that revealed the bug for me was:

unittest
{
  __m128i A  = _mm_setr_epi32(0x01, 0x02, 0x04, 0xf8);
  __m128i M  = _mm_setr_epi32(0x01, 0x40, 0x00, 0x00);
  __m128i Z = _mm_setzero_si128();
  assert(_mm_testnzc_si128(A, Z) == 0);
  assert(_mm_testnzc_si128(A, M) == 1);
  assert(_mm_testnzc_si128(A, A) == 0);
}

In my testing this makes the intrinsic agree with the instruction on that unittest.

p0nce avatar Aug 02 '21 09:08 p0nce