Non temporal data transfers
- Adding stream API for non temporal data transfers
- Adding xsimd::fence as a wrapper around std atomic for cache coherence
- Adding tests
~~Draft because I need to double check the API levels ( i.e I am not using AVX2 functions in AVX and so on).~~ I just wanted some feedback while I do the finishing touches.
Some generic thoughts:
- I'm unsure the fence belongs to xsimd, but I like being proven wrong, maybe show us a code example that uses it?
load_streamorstream_loadorstreaming_load?
On arm64, there's no support for non-temporal loads (https://developer.arm.com/documentation/100048/0100/level-1-memory-system/memory-prefetching/non-temporal-loads), the corresponding instruction do exist (LDNP/STNP) but I failed to find the related intrinsic.
There seems to be something equivalent in riscv (see https://github.com/riscv-non-isa/riscv-c-api-doc/pull/47)
I couldn't find anything for webassembly nor Power. So that's quite a niche, but I'm fine with adding those though.
- I went for
load_streamandstore_streamso that it is consistent with[load|store]_[un]aligned... (Also load_non_temporal was too long and load_nta is not clear). - I added fence for convenience. I have no strong feelings on it. We can always think about adding it in the future. In the end on x86, I was recently made aware that it is not needed on a single core application. In parallel applications,
atomicis likely to be imported anyway. - About ARM and RISK-V what about making our own intrinsics by wrapping the inline assembly? I sadly do not know about ARM all that much to be able to promise I will help
Cheers, Marco
PS: sse2 adds APIS for non temporal stores of scalars of 32/64 bits. I am not sure the fit within xsimd though