darktable Testing Simde to convert SSE2 to NEON on ARM (radxa rock5b sbc)

Is your feature request related to a problem? Please describe.

I wanted to test if DT could be build (and used !) with SIMDE (https://github.com/simd-everywhere/simde/tree/master ) which translate SSE2 intrinsic to NEON intrinsic. Could maybe have an interest too for Apple arm users. My (little) problem was to try to speed up DT on my SBC (Radxa rock5b with 16GB RAM, CPU RK3588, GPU Mali G610). And curiosity ! ;-) Describe the solution you'd like

I do not ask for anything but want to share the results and the gain I had. It could maybe give ideas to developers (thank you for all the work !)

Alternatives

I made a dirty patch on darktable 4.5.0+1421~g4098679849 (cloned a few days ago). Just :

made a fake SSE/SSE2 compatibility to clang with : -D__SSE__ -D__SSE2__
forced simde to keep the name of real SSE/SSE2 intrinsic with : -DSIMDE_ENABLE_NATIVE_ALIASES
managed some problem on SDL by deactivating support of SSE/SSE2 with some other clang -D
add compile options to clang : -march=armv8-a+fp+simd+crypto+crc After that I changed the 's of all DT sources following the rules given by simde (with some simple "replace in project" calls) :

+#    mmintrin.h → simde/x86/mmx.h
+#    xmmintrin.h → simde/x86/sse.h
+#    emmintrin.h → simde/x86/sse2.h
+#    pmmintrin.h → simde/x86/sse3.h
+#    tmmintrin.h → simde/x86/ssse3.h
+#    smmintrin.h → simde/x86/sse4.1.h
+#    nmmintrin.h → simde/x86/sse4.2.h

The result is that DT-simde runs well as far as I can see.

I made some tests with DT benchmard (with v4.2 XMP) and, on CPU only (problem with my OpenCL configuration...), the troughput rating is at max (on 4 threads) 12% higher with simde (see graph below) : DT-simde

Additional context

I haven't wrote C code since the year 2000... I'm not sure this is of interest for the project but I made a patch (first time with Git...) if you want more details on the changes I made for this test only. I am a new user of DT and only make some works with it the last weeks.

Dec 13 '23 23:12 StDudule

It's quite likely that nearly all of the speedup you saw comes from a single occurrence of an SSE intrinsic - the mm_stream_ps in copy_pixel_nontemporal in src/develop/imageop.h. It probably makes more sense to add an implementation of that function which directly calls the corresponding NEON intrinsic rather than hacking in a bunch of alternate compilation flags and includes to be able to use a 'compatibility' library. That would have benefit for any ARM compilation, whether or not SIMDE is available.

Most of the SSE code remaining in darktable is due to the legacy colorbalance (non-RGB) module, for which the SSE code is still faster than the compiler's autovectorized code on x86. That pulls in the implementations of mm_exp2_ps and mm_log_ps needed for mm_pow_ps in src/common/sse.h and the remaining SSE colorspace conversions in src/common/colorspaces_inline_conversions.h.

Dec 14 '23 06:12 ralfbrown

Ok.Thank you @ralfbrown. Very interesting. I'll have a look at this intrinsic.

Dec 14 '23 08:12 StDudule

@ralfbrown , you were right. Just modified src/develop/imageop.h like follow :

#if defined(__aarch64__)
#include <arm_neon.h>
#endif

(...)

static inline void copy_pixel_nontemporal(
	float *const __restrict__ out,
        const float *const __restrict__ in)
{
#if defined(__SSE__)
  _mm_stream_ps(out, *((__m128*)in));
#elif defined(__aarch64__)
  vst1q_f32(out, *((float32x4_t *)in));
#elif (__clang__+0 > 7) && (__clang__+0 < 10)
  for_each_channel(k,aligned(in,out:16)) __builtin_nontemporal_store(in[k],out[k]);
#else
  for_each_channel(k,aligned(in,out:16) dt_omp_nontemporal(out)) out[k] = in[k];
#endif
}

on the same version of DT (to be comparable) and I have Throughput rating 6.4% higher than the "normal" (non modified) DT with the same parameters ( -C -t 4 -v4.2 ) It's more than half of 12% gain with Simde with just little change.

The only other change is -march=armv8-a+fp+simd+crypto+crc forced in the clang call because I'm not easy at all with CMake...

Dec 14 '23 22:12 StDudule

Good to hear. It occurred to me after my previous response that one other function in that file needs to be updated for safety - nontemporal writes can violate the normal write-ordering rules, so there should be a memory fence after the module finishes processing the image buffer. That's provided by dt_sfence, defined right below copy_pixel_nontemporal.

A quick check looking into https://github.com/DLTcollab/sse2neon/blob/master/sse2neon.h indicates that the standard C/C++ atomic_thread_fence function should be used as a replacement for _mm_sfence, which is already in place for the non-SSE path. I guess NEON doesn't have a separate store fence (which is faster, but dt_sfence gets called infrequently enough that a slower full fence doesn't actually make a measurable difference).

I'm not too familiar with CMake either, so if one of the other devs could chime in, I think we can make a PR between us (won't make it into 4.6 of course, but 4.6.1 is likely). BTW, is the +crypto+crc actually needed? I don't think darktable uses any crypto or CRC functions, so it wouldn't need the corresponding instructions.

A 6% overall speedup on a system of dt's size is a pretty good return for adding five lines of code....

Dec 14 '23 23:12 ralfbrown

Happy to help ! I will look for fences this weekend. I think there is something in NEON.

Dec 15 '23 10:12 StDudule

As far as CMake is concerned, you probably need to play w/ https://github.com/darktable-org/darktable/blob/master/cmake/march-mtune.cmake

Is ensuring -march=native through setting -DBINARY_PACKAGE_BUILD=OFF not sufficient?

If not, we probably need to refactor that file and add a case like elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|[Aa][Rr][Mm]64") while taking APPLE and AppleClang into account as well...

Dec 15 '23 15:12 kmilos

Thank you @kmilos. I test this tonight. It must be good with '-march=native' and my -march not necessary : https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/AArch64-Options.html#aarch64%2dfeature%2dmodifiers 3.18.1.1 -march and -mcpu Feature Modifiers (...) ‘fp’ Enable floating-point instructions. This is on by default for all possible values for options -march and -mcpu. ‘simd’ Enable Advanced SIMD instructions. This also enables floating-point instructions. This is on by default for all possible values for options -march and -mcpu.

As far as I understand ! ;-)

Dec 15 '23 16:12 StDudule

Thank you @kmilos. DT compile with initial set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${MARCH} ${DT_REQ_INSTRUCTIONS} -g") and benchmark runs and DT open my RAF/XMP. I'd like to have an Apple M2 or M3 to test...

Dec 15 '23 22:12 StDudule

@zisoft @MStraeten ping :-)

Dec 20 '23 05:12 jenshannoschwalm

I'd like to have an Apple M2 or M3 to test...

I'm still on an Intel Mac, unfortunately...

Dec 20 '23 07:12 zisoft

I don’t know what to do to help - my experience in setting up a cmke build environment is just being able to ask ChatGPT ;) But I can do a build if I get a proper macports based description like in https://github.com/darktable-org/darktable/blob/master/packaging/macosx/BUILD.txt.

Dec 20 '23 08:12 MStraeten

Thank you. I'll have a look. I discover a lot of things then I am not very fast. After Christmas... 😉

Dec 21 '23 19:12 StDudule

Hi, I have been out for personal reasons for a few week. I manage to rebase my little code on last dev version and make a little doc to test on Apple M* this weekend. Sorry for the delay.

Jan 30 '24 12:01 StDudule

This issue has been marked as stale due to inactivity for the last 60 days. It will be automatically closed in 300 days if no update occurs. Please check if the master branch has fixed it and report again or close the issue.

Mar 31 '24 00:03 github-actions[bot]