simde icon indicating copy to clipboard operation
simde copied to clipboard

Macros and postincrement

Open jean343 opened this issue 2 years ago • 3 comments

When using the postincrement operator (i++) in an SIMDE function call, it sometimes gets incremented multiple times because of macro syntax. See this question for reference.

In my specific case, the function _mm_srli_epi32(InReg1, shift++) ends up incrementing shift by 4 only on the ARM platform. This is very difficult to debug as the code is working on Intel machines, and the code in question is nested very deep in a third party library.

I would recommend something like this. The code does work on ARM. It does not compile on Visual Studio, but it's also not required in VS.

#elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
  #define simde_mm_srli_epi32(a, _imm8) \
    ({ typeof (_imm8) imm8 = (_imm8); \
    (((imm8) <= 0) ? \
      (a) : \
      simde__m128i_from_neon_u32( \
        ((imm8) > 31) ? \
          vandq_u32(simde__m128i_to_neon_u32(a), vdupq_n_u32(0)) : \
          vshrq_n_u32(simde__m128i_to_neon_u32(a), ((imm8) & 31) | (((imm8) & 31) == 0)))); })

jean343 avatar Mar 31 '23 15:03 jean343

@jean343 Can you send a PR , and also add a test for this to prevent a future regression? Let us know if you have questions about how to do that and we can help.

mr-c avatar Apr 01 '23 09:04 mr-c

@mr-c I have been evaluating the best solution for this problem, and it's very difficult to find a good cross-compiler solution.

My snippet above works well in gcc, but fails in clang with error: argument to '__builtin_neon_vshrq_n_v' must be a constant integer https://godbolt.org/z/zYKsYcE69

Also note that gcc returns significantly more efficient code as the argument for simde_mm_srli_epi32 is a const int.

To my understanding, gcc should fail to compile the code above as we pass a non const variable to a function expecting a const.

Looking for your advise :)

jean343 avatar Apr 03 '23 18:04 jean343

it sometimes gets incremented multiple times because of macro syntax

That is weird. From here, I'd say try replacing typeof with __auto_type and see if that works. However, that wouldn't be a good fix for SIMDE. Maybe one could just replace typeof with int... but that leaves us doing bitwise operations on a signed integer. (grrrrr...)

To my understanding, gcc should fail to compile the code above as we pass a non const variable to a function expecting a const.

The portable fix would be to implement simde_mm_srli_epi32(a, b) as _mm_srl_epi32(a, _mm_cvtsi32_si128(b)). However, that would have a performance impact on some compilers (a SSE2 variable shift costs more than a shift by immediate).

I think the 3rd party code is trying to force gcc to unroll that loop properly... So I'd say either unroll the loop entirely by hand... or simplify it and let the compiler worry about it:

// untested
  for (unsigned i = 0; i < 8; ++i) { // loop thru 32-bits (4 bits at a time)
    OutReg1 = _mm_and_si128(InReg1, mask);
    OutReg2 = _mm_and_si128(_mm_srli_epi32(InReg1, 1), mask);
    OutReg3 = _mm_and_si128(_mm_srli_epi32(InReg1, 2), mask);
    OutReg4 = _mm_and_si128(_mm_srli_epi32(InReg1, 3), mask);
    InReg1 = _mm_srli_epi32(InReg1, 4); // <-- gcc doesn't understand this when unrolling
    MM_STORE_SI_128(out++, OutReg1);
    MM_STORE_SI_128(out++, OutReg2);
    MM_STORE_SI_128(out++, OutReg3);
    MM_STORE_SI_128(out++, OutReg4);
  }

aqrit avatar Apr 06 '23 02:04 aqrit