easyaspi314
easyaspi314
> Ah, I'll try `__attribute__((optimize("no-tree-vectorize")))` > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106322#c14 attribute optimize does work temporarily but it does disable inlining which emits a warn/error. ```c /* GCC 12.1-12.2.0 emit garbage if vector.umulh...
That backwards loop is off by one.
So primarily looking at neon, what if we had ```c typedef struct { ... } simde_uint8x8_t; typedef struct { ... #ifdef SIMDE_X86_SSE2_NATIVE __m128i m128i; #endif } simde_private_uint8x8_t; simde_private_uint8x8_t simde_uint8x8_t_to_private(simde_uint8x8_t x)...
Yes, but also I say no MMX allowed, period. It isn't something that is transparently handled by the compiler (unlike, say, `vzeroupper`) which is inappropriate for a library that attempts...
[Example of MMX breaking things](https://tio.run/##XVBha4MwEP2eX3F0DGJ7TdU6V2g72PfBfkAZEo26gCZF7WhX/OtzF7t1o5Dkcu/uvbxcNs8qacphuNMmqw4qh01da9M12oj3J/aHtp3S1kEfVisorOUMIEnqOIJpk7dEyDqQSGBlTQnKHtIq/1dKmcfOV8oRtpDUddLm3Wfe2KTVccS99Q37RF2@cGhhG@BkC7SD1hQ2EFGYzTxwqnBVlEole70M@RFB7vTbKAokNdtCSrlLe9py578R5ejy9HI/rVnPmHulltrwH7/PlS6NbPnKA1eSO9d6hgBDXGKEDxjjI676W@/pb5/AUOBSYCSgv3zFcj5OYepJhHQ0uFhAEEPoQ@ALn/I9zb8r@ORegVsvxQRHx@4M6HSOidgPw1dWVLJsh/lr@A0) Messing with the optimization levels can result in differing values, even if you put `_mm_empty()` after each intrinsic (which would force the vector to memory...
> 128 bit intrinsics are used to implement smaller data structures in other APIs to avoid use of MMX. This should actually be kept in general, unless array indexing is...
Some deep dives into codegen brought me to some interesting things: 1. The ABI for `__attribute__((vector_size(8)))/__m64` is inconsistent on 32-bit x86. | Compiler | Features | `vector_size(8)` | `__m64` |...
NEON_2_SSE uses an array union. https://github.com/intel/ARM_NEON_2_x86_SSE/blob/master/NEON_2_SSE.h I say that for best results we should use - `int64_t` for x86 (because double uses X87 and GCC vector uses MMX). We can...
Ok idea for a roadmap: High priority: 1. Remove MMX intrinsics 2. Change half vector type to avoid dangerous+slow MMX vectors Low priority: 3. Convert most half NEON intrinsics to...
I have the minimum cost half vector ABI and conversion figured out. It unfortunately needs a hack on GCC to avoid excess `movq`s though. ### x86 ABI ```c typedef struct...