xsimd icon indicating copy to clipboard operation
xsimd copied to clipboard

dispatch assumes that target architectures are supported

Open tomjnixon opened this issue 2 years ago • 6 comments

First, sorry for the long issue, or if it's wildly off-base.

Compiler architecture flags change both the set of instructions available (through intrinsics) and the set of instructions used by normal code. This means that in general it is not safe to run a program compiled with -mavx2 (for example) on a CPU that doesn't support AVX2, even if any code using AVX2 intrinsics is disabled at run-time, because the compiler is free to generate them for normal code too.

I've demonstrated this here: https://gcc.godbolt.org/z/Mvf5GxGGe

The way I've implemented this before is to compile vector functions multiple times for different architectures (producing different symbols), and link them with the rest of the application and dispatch code which is compiled for the lowest supported architecture.

This seems messy, but I think it's the only way to implement run-time dispatch reliably. I guess it's been fine up to now because most people are running a CPU supporting AVX2+FMA and aren't enabling options to generate avx512, but that could change.

This way of doing things should work fine with xsimd, but there's a couple of issues:

  • dispatch checks that all architectures are supported, but when compiled with the lowest supported architecture they will not all be. Could this check just be removed?

    The default architecture list only contains supported architectures, and the dispatched function would fail to compile anyway. The downside would be worse error messages if you pass an unsupported architecture to dispatch, but it's only as bad as manually trying to use a batch with an unsupported architecture, and that could probably be made better with a static_assert in batch instead.

  • The documentation doesn't mention this. Explaining it might be a bit tricky and off-putting, especially without an example and/or cmake support, so maybe that should come first?

tomjnixon avatar Nov 04 '21 14:11 tomjnixon

This means that in general it is not safe to run a program compiled with -mavx2 (for example) on a CPU that doesn't support AVX2, even if any code using AVX2 intrinsics is disabled at run-time, because the compiler is free to generate them for normal code too.

The only thing that xsimd's Arch template parameter changes is the register size. The compiler is free to optimize it to match the -march target, as shown here: https://gcc.godbolt.org/z/r5qxcYrj3.

The way I've implemented this before is to compile vector functions multiple times for different architectures (producing different symbols), and link them with the rest of the application and dispatch code which is compiled for the lowest supported architecture.

Vc does a similar thing, it's the way we use at Krita to runtime dispatch correctly: https://github.com/VcDevel/Vc/blob/1.4/cmake/VcMacros.cmake#L447

dispatch checks that all architectures are supported, but when compiled with the lowest supported architecture they will not all be. Could this check just be removed?

There's something worse, in the (unlikely) case an architecture isn't available, it will crash with an assertion. Shouldn't it have the generic arch as a lowest common denominator?

amyspark avatar Nov 04 '21 15:11 amyspark

This means that in general it is not safe to run a program compiled with -mavx2 (for example) on a CPU that doesn't support AVX2, even if any code using AVX2 intrinsics is disabled at run-time, because the compiler is free to generate them for normal code too.

The only thing that xsimd's Arch template parameter changes is the register size. The compiler is free to optimize it to match the -march target, as shown here: https://gcc.godbolt.org/z/r5qxcYrj3.

Right, yeah; that's even worse.

The way I've implemented this before is to compile vector functions multiple times for different architectures (producing different symbols), and link them with the rest of the application and dispatch code which is compiled for the lowest supported architecture.

Vc does a similar thing, it's the way we use at Krita to runtime dispatch correctly: https://github.com/VcDevel/Vc/blob/1.4/cmake/VcMacros.cmake#L447

Yeah, that looks familiar. Shame it's so complicated to make it work with lots of compiler versions.

dispatch checks that all architectures are supported, but when compiled with the lowest supported architecture they will not all be. Could this check just be removed?

There's something worse, in the (unlikely) case an architecture isn't available, it will crash with an assertion. Shouldn't it have the generic arch as a lowest common denominator?

That seems like something for the user to chose. For me this is fine as I don't support really old CPUs and so it would be a waste of code size, but I could see the argument for it to be added to supported_architectures. (As far as i can tell available() always returns true anyway so the assert doesn't fire, but still...)

tomjnixon avatar Nov 04 '21 16:11 tomjnixon

https://github.com/xtensor-stack/xsimd/pull/675 proposes an approach to fix that issue, I'd happily take feedbacks.

serge-sans-paille avatar Feb 22 '22 16:02 serge-sans-paille

Thanks.

I think the issues in the code were fixed in #635, as your test shows. It looks sensible, though I'd be interested to know whether the xsimd functions used end up being compiled for the right architecture. I've used intel SDE in the past to test this kind of thing -- it would be really cool to see that integrated into the test suite.

The remaining problems for me are that it's not easy to do this correctly (i.e. through cmake or some clever macros), and that the docs imply that it's much easier than it is. These are perhaps separate issues that could be more clearly described, though.

tomjnixon avatar Feb 22 '22 16:02 tomjnixon

https://github.com/xtensor-stack/xsimd/pull/675 proposes an approach to fix that issue, I'd happily take feedbacks.

I'd say to templatize the architecture parameter or make it part of the function signature, the current dispatch is very much still unreadable.

The remaining problems for me are that it's not easy to do this correctly (i.e. through cmake or some clever macros), and that the docs imply that it's much easier than it is. These are perhaps separate issues that could be more clearly described, though.

Though this is the real problem. The CMake macros I planned for this are definitely in the hellish side of the aisle, and https://github.com/xtensor-stack/xsimd/issues/643#issuecomment-1014924684 doesn't make it any less unbearable.

amyspark avatar Feb 22 '22 16:02 amyspark

Thanks for the feedback. I force-pushed a documentation update in #675, reverting to the previous API. I think that with that documentation, the CMake approach becomes super obvious: one just needs to compile every template explicit instantiation in a separate source with the appropriate arch flag.

serge-sans-paille avatar Feb 26 '22 16:02 serge-sans-paille

No extra request on documentation → I'm closing the issue, I consider the original issue addressed.

serge-sans-paille avatar Oct 14 '22 07:10 serge-sans-paille