easyaspi314
easyaspi314
Putting it in a double is *extremely* convenient because by default, doubles are already going to be in XMM registers. All we need is the compiler to eliminate the zero...
WASM is ~~a blatant copy of~~ very similar to NEON (just without half vectors), so it would probably be an easy port — especially since the emscripten toolchain includes a...
Hmm. Well I did some tests and apparently with the exception of `mask == 0x888` (which lands in the rarer 4 byte path) the first and second halves of the...
Well an initial test on an ARM Cortex-X1 (yes it is my phone) + Clang 16 shows about 5% overhead and probably can be tuned to be less. I can...
Cortex-X1, original, clang 16.0.2 ``` 1.780 GB/s (3.2 %) 0.997 Gc/s 1.78 byte/char 2.899 GB/s (2.2 %) 0.974 Gc/s 2.98 byte/char 1.226 GB/s (4.7 %) 0.306 Gc/s 4.00 byte/char 1.774...
> So this PR appears to shave 480 bytes from the library on my machine right now. (So 0.1%.) The `.a` file isn't really indicative of the actual binary size,...
I think that there is a microoptimization issue, GCC x64 seems to get wildly different performance depending on where the branches are.
Drafting this for now, as I want to improve the intrinsics first before touching the table so I can get a better performance analysis.
> While the minimum requirements of Windows 95 is a 386 with 4MB RAM, it is not really a useful system and will start swapping as soon as you do...
This is because of `__restrict` not being accepted by `lcc`. Defining it to empty fixes the issue.