simde
simde copied to clipboard
_mm_insert_ps is broken on ARM processor
Hi,
I noticed that _mm_insert_ps is broken on 0.7.2 (and I noticed still on master as well).
I have a fix that worked for me looking at: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE4_1&ig_expand=4048
SIMDE_FUNCTION_ATTRIBUTES
simde__m128
simde_mm_insert_ps (simde__m128 a, simde__m128 b, const int imm8)
SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) {
simde__m128_private
r_,
a_ = simde__m128_to_private(a),
b_ = simde__m128_to_private(b);
float tmp1_ = b_.f32[(imm8 >> 6) & 3];
a_.f32[(imm8 >> 4) & 3] = tmp1_;
SIMDE_VECTORIZE
for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) {
r_.f32[i] = ((imm8 >> i) & 1 ) ? SIMDE_FLOAT32_C(0.0) : a_.f32[i];
}
return simde__m128_from_private(r_);
}
Please consider adding this to the new release or a similar fix.
The issue was a wrong copy and later missed masking.
For context the original is:
SIMDE_FUNCTION_ATTRIBUTES
simde__m128
simde_mm_insert_ps (simde__m128 a, simde__m128 b, const int imm8)
SIMDE_REQUIRE_CONSTANT_RANGE(imm8, 0, 255) {
simde__m128_private
r_,
a_ = simde__m128_to_private(a),
b_ = simde__m128_to_private(b);
a_.f32[0] = b_.f32[(imm8 >> 6) & 3];
a_.f32[(imm8 >> 4) & 3] = a_.f32[0];
SIMDE_VECTORIZE
for (size_t i = 0 ; i < (sizeof(r_.f32) / sizeof(r_.f32[0])) ; i++) {
r_.f32[i] = (imm8 >> i) ? SIMDE_FLOAT32_C(0.0) : a_.f32[i];
}
return simde__m128_from_private(r_);
}
Best Regards, Jawad
@MirJawadMairaj Thank you for your report and suggested fix. Can you make a pull request?
Hi! I added here: https://github.com/simd-everywhere/simde/pull/932
Hopefully that works. I made the pull request on master.
Didn't change the unit test as I am not running the test suite. If that is a requirement than please guide how to set that up on windows.
I might have some time later next week to look into it more.
Best Regards, Jawad