Testing Simde to convert SSE2 to NEON on ARM (radxa rock5b sbc)
Is your feature request related to a problem? Please describe.
I wanted to test if DT could be build (and used !) with SIMDE (https://github.com/simd-everywhere/simde/tree/master ) which translate SSE2 intrinsic to NEON intrinsic. Could maybe have an interest too for Apple arm users. My (little) problem was to try to speed up DT on my SBC (Radxa rock5b with 16GB RAM, CPU RK3588, GPU Mali G610). And curiosity ! ;-) Describe the solution you'd like
I do not ask for anything but want to share the results and the gain I had. It could maybe give ideas to developers (thank you for all the work !)
Alternatives
I made a dirty patch on darktable 4.5.0+1421~g4098679849 (cloned a few days ago). Just :
- made a fake SSE/SSE2 compatibility to clang with :
-D__SSE__ -D__SSE2__ - forced simde to keep the name of real SSE/SSE2 intrinsic with :
-DSIMDE_ENABLE_NATIVE_ALIASES - managed some problem on SDL by deactivating support of SSE/SSE2 with some other clang
-D - add compile options to clang :
-march=armv8-a+fp+simd+crypto+crcAfter that I changed the's of all DT sources following the rules given by simde (with some simple "replace in project" calls) :
+# mmintrin.h → simde/x86/mmx.h
+# xmmintrin.h → simde/x86/sse.h
+# emmintrin.h → simde/x86/sse2.h
+# pmmintrin.h → simde/x86/sse3.h
+# tmmintrin.h → simde/x86/ssse3.h
+# smmintrin.h → simde/x86/sse4.1.h
+# nmmintrin.h → simde/x86/sse4.2.h
The result is that DT-simde runs well as far as I can see.
I made some tests with DT benchmard (with v4.2 XMP) and, on CPU only (problem with my OpenCL configuration...), the troughput rating is at max (on 4 threads) 12% higher with simde (see graph below) :
Additional context
I haven't wrote C code since the year 2000... I'm not sure this is of interest for the project but I made a patch (first time with Git...) if you want more details on the changes I made for this test only. I am a new user of DT and only make some works with it the last weeks.
It's quite likely that nearly all of the speedup you saw comes from a single occurrence of an SSE intrinsic - the mm_stream_ps in copy_pixel_nontemporal in src/develop/imageop.h. It probably makes more sense to add an implementation of that function which directly calls the corresponding NEON intrinsic rather than hacking in a bunch of alternate compilation flags and includes to be able to use a 'compatibility' library. That would have benefit for any ARM compilation, whether or not SIMDE is available.
Most of the SSE code remaining in darktable is due to the legacy colorbalance (non-RGB) module, for which the SSE code is still faster than the compiler's autovectorized code on x86. That pulls in the implementations of mm_exp2_ps and mm_log_ps needed for mm_pow_ps in src/common/sse.h and the remaining SSE colorspace conversions in src/common/colorspaces_inline_conversions.h.
Ok.Thank you @ralfbrown. Very interesting. I'll have a look at this intrinsic.
@ralfbrown , you were right.
Just modified src/develop/imageop.h like follow :
#if defined(__aarch64__)
#include <arm_neon.h>
#endif
(...)
static inline void copy_pixel_nontemporal(
float *const __restrict__ out,
const float *const __restrict__ in)
{
#if defined(__SSE__)
_mm_stream_ps(out, *((__m128*)in));
#elif defined(__aarch64__)
vst1q_f32(out, *((float32x4_t *)in));
#elif (__clang__+0 > 7) && (__clang__+0 < 10)
for_each_channel(k,aligned(in,out:16)) __builtin_nontemporal_store(in[k],out[k]);
#else
for_each_channel(k,aligned(in,out:16) dt_omp_nontemporal(out)) out[k] = in[k];
#endif
}
on the same version of DT (to be comparable) and I have Throughput rating 6.4% higher than the "normal" (non modified) DT with the same parameters ( -C -t 4 -v4.2 )
It's more than half of 12% gain with Simde with just little change.
The only other change is -march=armv8-a+fp+simd+crypto+crc forced in the clang call because I'm not easy at all with CMake...
Good to hear. It occurred to me after my previous response that one other function in that file needs to be updated for safety - nontemporal writes can violate the normal write-ordering rules, so there should be a memory fence after the module finishes processing the image buffer. That's provided by dt_sfence, defined right below copy_pixel_nontemporal.
A quick check looking into https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h indicates that the standard C/C++ atomic_thread_fence function should be used as a replacement for _mm_sfence, which is already in place for the non-SSE path. I guess NEON doesn't have a separate store fence (which is faster, but dt_sfence gets called infrequently enough that a slower full fence doesn't actually make a measurable difference).
I'm not too familiar with CMake either, so if one of the other devs could chime in, I think we can make a PR between us (won't make it into 4.6 of course, but 4.6.1 is likely). BTW, is the +crypto+crc actually needed? I don't think darktable uses any crypto or CRC functions, so it wouldn't need the corresponding instructions.
A 6% overall speedup on a system of dt's size is a pretty good return for adding five lines of code....
Happy to help ! I will look for fences this weekend. I think there is something in NEON.
As far as CMake is concerned, you probably need to play w/ https://github.com/darktable-org/darktable/blob/master/cmake/march-mtune.cmake
Is ensuring -march=native through setting -DBINARY_PACKAGE_BUILD=OFF not sufficient?
If not, we probably need to refactor that file and add a case like elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|[Aa][Rr][Mm]64") while taking APPLE and AppleClang into account as well...
Thank you @kmilos. I test this tonight. It must be good with '-march=native' and my -march not necessary : https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/AArch64-Options.html#aarch64%2dfeature%2dmodifiers 3.18.1.1 -march and -mcpu Feature Modifiers (...) ‘fp’ Enable floating-point instructions. This is on by default for all possible values for options -march and -mcpu. ‘simd’ Enable Advanced SIMD instructions. This also enables floating-point instructions. This is on by default for all possible values for options -march and -mcpu.
As far as I understand ! ;-)
Thank you @kmilos.
DT compile with initial
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${MARCH} ${DT_REQ_INSTRUCTIONS} -g")
and benchmark runs and DT open my RAF/XMP.
I'd like to have an Apple M2 or M3 to test...
@zisoft @MStraeten ping :-)
I'd like to have an Apple M2 or M3 to test...
I'm still on an Intel Mac, unfortunately...
I don’t know what to do to help - my experience in setting up a cmke build environment is just being able to ask ChatGPT ;) But I can do a build if I get a proper macports based description like in https://github.com/darktable-org/darktable/blob/master/packaging/macosx/BUILD.txt.
Thank you. I'll have a look. I discover a lot of things then I am not very fast. After Christmas... 😉
Hi, I have been out for personal reasons for a few week. I manage to rebase my little code on last dev version and make a little doc to test on Apple M* this weekend. Sorry for the delay.
This issue has been marked as stale due to inactivity for the last 60 days. It will be automatically closed in 300 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.