xsimd Implementing intrinsics that were released along a wider register type

Some intrinsics for a size N are introduce in the same generation that introduces a register of size `2N.

_mm_srlv_epi32 (128 bits) is introduced in Avx2, along with _mm256
Plently of 128 and 256 API introduced in AVX512

I'm wondering how to implement that in Xsimd. My best understanding is that when compiling with AVX2, std::make_sized_batch<uint8_t, 16>() will return an xsimd::sse4.2 architecture and the dispatch mechanism cannot know from requires_arch that the AVX2 128 bit instruction is available.

One way to work around it is using if constexpr(supported_architectures::contains<avx2>()) but that seems to duplicate the dispatch mechanism.

Another possibility could be to decouple the architecture from the register type.

What do you think @JohanMabille @serge-sans-paille ?

Oct 30 '25 16:10 AntoinePrv

I was thinking of the same problem. My guess for now would be to introduce an sse_avx sse_av2, sse_vl register in the hierarchy.

Then the make_sized_batch returns the appropriate one. In avx and avx2, avx512 we need to override the forward to sse and forward to avx functions.

An alternative is to have an avx<sse> class as it happens with fma

PS: this is related to #1009 so probably requires some more thoughts.

Oct 30 '25 19:10 DiamonDinoia

I think there are two orthognoal problems here. The first one is the way we represent the instruction set extensions in xsimd. This is a topic we've been discussing for quite a long with Serge, and so far the idea is to be able to add "flavors" to the instruction set tag; either with template parameters (as suggested in #1009 and by @DiamonDinoia ), or with expressions like "avx & fma".

The second one is that the arch we pass to the implementation functions is that of the batch (see https://github.com/xtensor-stack/xsimd/blob/master/include/xsimd/types/xsimd_api.hpp#L60 for instance). That could be fixed with something like:

    template <class T, class A>
    XSIMD_INLINE batch<T, A> abs(batch<T, A> const& x) noexcept
    {
        detail::static_check_supported_config<T, A>();
        return kernel::abs<A>(x, detected_arch{});
    }

Oct 31 '25 08:10 JohanMabille

My naive thinking is to use XSIMD_DECLARE_SIMD_REGISTER_ALIAS to declare a sse_avx register that inherits from sse4_2 and overrride just the kernels that benefits from avx on sse. May I ask where this falls apart?

Nov 01 '25 19:11 DiamonDinoia

My naive thinking is to use XSIMD_DECLARE_SIMD_REGISTER_ALIAS to declare a sse_avx register that inherits from sse4_2 and overrride just the kernels that benefits from avx on sse. May I ask where this falls apart?

That's the easy part. You also want this type to derive from sse4_2, so that automatic fallback works as expected.

Now for the difficult part, what would be the naming scheme? So far we've use template composition, e.g.

fma3<sse4_2> to specify sse4.2 with fma3 extension.

so in that spirit we would have

avx512f<sse4_2> to specify sse4.2 with avx512f extensions.

Unfortunately we already use avx512f as an architectural type. But maybe

ext::avx512f<sse4_2> would be good? That way we would also have

ext::avx512f

This may mean we'd use ext::fma3 instead of fma3, that's an API break but I'm fine with it.

I like that idea and can implement it before the release, but it's a non negligible feature change, so I think it's worth being merged after the release, so that we can peacefully explore the consequences after the release.

Nov 01 '25 22:11 serge-sans-paille

I agree that this change should be merged after the release. Regarding the scheme, we could keep backward compatibility by defining ext::fma3 as fma3 first, and then remove fma3 latter when we decide to cut a major release.

Nov 02 '25 22:11 JohanMabille

Sure, I also agree that this is something for after the release! I just wanted to brainstorm since the discussion was open

Nov 03 '25 00:11 DiamonDinoia