STL
STL copied to clipboard
std::swap of arrays, why is there no specialization for trivial types
I was looking at what swap was doing for arrays and in the MSVC stl it loops trough the array and swap each element. (libstdc++ and libc++ do the same thing).
The thing is, it was an array of std::byte, I was surprised that none of the STL have specializations for trivially copyable/movable types that would call memcopy. Now I figure, this is maybe how the spec is. So my first question is there something preventing those specialization from existing.
My second question which is maybe off-topic for this is even tough clang and gcc also do a loop, "I believe" their auto vectorizer see the pattern and emits SSE instructions but MSVC doesn't it still loops.
Here is a compiler explorer link https://godbolt.org/z/W8GxPv1ov
Now this is maybe something more for the compiler team, I don't know. In any case I saw the differences and I thought I would point it out.
Thank you
I think this is a compiler issue, as libstdc++'s implementation of std::swap is naive.
Perhaps we can use the same mechanism as that of std::swap_ranges for vectorization, but I don't know if it is suitable.
@monamimani , note that we can't simply use memcpy/memmove as they don't swap, they overwrite (so we'd need temporary storage).
@frederick-vs-ja , I believe that the vectorized swap_ranges implementation is indeed applicable to std::swap() for built-in arrays; I also observe that std::array::swap already uses it:
https://github.com/microsoft/STL/blob/314f65f831f827dbfb05ebcc0f08c496f5ddf64f/stl/inc/array#L461-L463 https://github.com/microsoft/STL/blob/314f65f831f827dbfb05ebcc0f08c496f5ddf64f/stl/inc/xutility#L6129-L6143
@StephanTLavavej, of course it would need temporary storage like swapping ints. Sorry that I wasn't clear enough. :)
Because I feel this is more an issue with the vectorizer I also opened an issue on the Developper Community
I can implement this - it just needs a bit of surgery in <xutility> and <utility>. PR coming soon...
I can implement this - it just needs a bit of surgery in
<xutility>and<utility>. PR coming soon...
Thank you!
But I found that <utility> is categorized as a core header in MSVC STL, while <tuple> is not. I'm afraid that it is possibly the vectorized algorithms declared in <xutility> that make <tuple> non-core. BTW, I think <tuple> should also be a core header.
Great thanks!
I need to revert my PR as this broke users of std::swap who were including only <utility>, see #2699. Reopening this issue as we may be able to solve this later, but we need to carefully think about users like ATL who are including <type_traits> and expecting no IDL mismatch pragmas to be dragged in. (We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the <yvals.h> machinery.)
As I figured, at the core it might be a Vectorizer issue I also opened an issue over at the Developper Community, here
They have acknowledge the issue and are working on it, it may take time. People that have access might be able to know more. I at least wanted to tell you they are aware of this.
I don't exactly know what the mechanism of std::swap_ranges does but, as a stop gap is it possible to just do a specialization for trivial type, constrain it and use memcpy? I bet the standard specify more and would prevent that.
Why not do something stupid like this?
std::byte storageTmp[Size];
std::memcpy(storageTmp, arrayA, Size);
std::memcpy(arrayA, arrayB, Size);
std::memcpy(arrayB, storageTmp, Size);
I am pretty sure that something that I don't know about will make this none desirable, but this should get vectorized.
I might add that from their own page The 13xx reason codes apply to the vectorizer.
void code_1300(int *A, int *B)
{
// Code 1300 is emitted when the compiler detects that there is
// no computation in the loop body.
for (int i=0; i<1000; ++i)
{
A[i] = B[i]; // Do not vectorize, instead emit memcpy
}
}
It says it should emit memcpy. but it doesn't. :)
Thanks - the autovectorizer may not be able to deal with the swap algorithm, but we do plan to properly fix this in the STL in the future. (Our vectorized algorithm is faster than memcpy, by the way.)
There's a possibility to implement that in headers using memcpy and fixed power-of-two portions.
See how copying by 32-bit portions is vectorized using SSE2 and AVX2: https://godbolt.org/z/YcEPz848W
Note how despite using immediate buffer in C++ it is not used in the assembly.
Can do this in two or more loops with descending sizes to handle the variety of sizes.
The exact portions size and the way of handling the tail (memcpy of variable length vs manual loop vs unrolled loop) is sort of trade-off between code size and performance.