`remove` vectorization
📜 The algorithm
In-place replace algorithm that uses bit mask of vector comparisons as shuffle index to remove certain elements.
The destination is advanced by a size taken from a lookup table too, although popcnt could have been used.
The details vary on depending on element size:
- The tables are selected to be of up to 256 entries, to process up to 8 elements at once. Bigger tables don't fit top level caches well. Some tables are smaller, when less than 8 elements can fit the vector.
- 8 and 16 bit variants use SSE only, as 8 elements of 8 or 16 bits fit SSE register, and more elements will take bigger table, Also there's no cross-lane shuffle within SSE4.2 / AVX2 that works with elements of such sizes. They use the famous
pshufb/_mm_shuffle_epi8to remove elements. - 32 and 64 bit variants use AVX2, and 64 bit variant uses only a table of 16 entries, as there are up to 4 elements only. They use
vpermd/_mm256_permutevar8x32_epi32to remove elements, which is cross lane. SEE fallbacks are used with smaller tables, still surprisingly more efficient than scalar. - Need contiguous mask of bits, single bit per element, not few bits for the same comparison. 8-bit uses
pmovmskb. 32 and 64 bit usevmovmskps/vmovmskpd, though they are for floating types, they fit well, and avoid the need of cross-lane swizzling to compress the mask. For 16-bit,packsswbis used, althoughpshufbcould have been used as well.
🔍 Find first!
Before even starting, find is performed to find the first mismatch element. This is done for the correctness, and also there are performance reasons why it is good:
- Correctness. [algorithms.requirements]/3 states: For purposes of determining the existence of data races, algorithms shall not modify objects referenced through an iterator argument unless the specification requires such modification. Whereas [alg.remove] is vague on how the algorithm should work, I think we should only write to elements that has to be written to
- Vectorization, We can have full AVX2 vector size as the step always, not only for 32 and 64 bit elements
- Memory bandwidth. The vectorized algorithm might be memory bound, saving writes may make it faster
- Number of operations. Fewer ops to just test the content
The existing find implementation is called. Hypothetically I could implement it inline and save some instructions in some cases, but such optimization has too negligible effect on performance, while increasing complexity noticeably. Though this might be revised for future remove_copy if that and this would share the implementation.
⚠️ Correctness doubt - superfluous writes
The algorithm removes elements from the source vector (of 8 or less elements) by a shuffle operation, so that non-removed elements are placed contiguously in that vector. Then it writes the whole vector to the destination, and advances the destination pointer to the size of non-removed elements.
As a result:
- In the remaining range some elements are overwritten with some values, before they are overwritten with expected values
- In the removed range, some amount of elements (up to 8 of them) are overwritten with values of other elements, and never restored.
I have no doubts that overwriting elements in the resulting range to to some intermediate values before setting them to the expected values is correct. The write and the data race (in abstract machine terms) exist anyway, so extra write is not observable.
I have concerns regarding damaging the removed range. Changing these values is observable.
I'd appeal to that elements in removed range stay in valid-but-uspecified state, although I understand that the purpose of standard saying that is to enable moving of non-trivially-copyables, but not to do what I did.
Note that:
- It is possible to avoid to do any of superfluous write, but it will have some cost
- The cost of avoiding superfluous writes is small for 32 and 64 bit elements, and larger for 8 and 16 bit elements
- When//if vectorizing
remove_copyin a similar way have to avoid superfluous writes anyway
⏱️ Benchmark results
| Benchmark | main | this |
|---|---|---|
| r<alg_type::std_fn, std::uint8_t> | 944 ns | 294 ns |
| r<alg_type::std_fn, std::uint16_t> | 1470 ns | 297 ns |
| r<alg_type::std_fn, std::uint32_t> | 1059 ns | 403 ns |
| r<alg_type::std_fn, std::uint64_t> | 1498 ns | 884 ns |
| r<alg_type::rng, std::uint8_t> | 1208 ns | 307 ns |
| r<alg_type::rng, std::uint16_t> | 1386 ns | 288 ns |
| r<alg_type::rng, std::uint32_t> | 1218 ns | 397 ns |
| r<alg_type::rng, std::uint64_t> | 1411 ns | 842 ns |
Irony: a PR that adds vectorization entitled "remove vectorization".
- Correctness. [algorithms.requirements]/3 states: For purposes of determining the existence of data races, algorithms shall not modify objects referenced through an iterator argument unless the specification requires such modification. Whereas [alg.remove] is vague on how the algorithm should work, I think we should only write to elements that has to be written to
There's a caveat: only the elements before the first removed one are not written. Elements after the resulting range (up to 8 of them) may be overwritten with values of some other elements. That's valid but unspecified state as in [alg.remove]/7
If that's not fine, have to use AVX2 masks, which would be slower. For 8 and 16 bit it is possible to handle with a temporary destination, but that's even more slow.
remove_copy if it succeeds going to be very similar, but would always use AVX2 mask, and so available only for 32 and 64 bit elements. I can try this within the same PR, although it seems big already.
Modern AMD data would be interesting. To avoid tricky swizzling, I added mixing integers and floats in 39974d1. The overall results are better for me, but this mixing seems to have more penalty on AMD than on Intel
remove_copyif it succeeds going to be very similar, but would always use AVX2 mask, and so available only for 32 and 64 bit elements. I can try this within the same PR, although it seems big already.
I'm now seeing multiple solutions how to do remove_copy for 8 and 16 bit elements
Thanks for the detailed writeup! I have no correctness concerns here - the Standard has a note that the garbage values are valid-but-unspecified, so we're fully within our rights to leave totally random values in there, even if the range originally contained (for example) only 10 and 20.
"If you want partition, you know where to find it."
Thanks! :heart_eyes_cat: Pushed changes as usual, the most significant making the <algorithm>/<xmemory> layer responsible for performing the initial find. Good results on my 5950X:
| Benchmark | Before | After | Speedup |
|---|---|---|---|
r<alg_type::std_fn, std::uint8_t> |
1291 ns | 360 ns | 3.59 |
r<alg_type::std_fn, std::uint16_t> |
1291 ns | 338 ns | 3.82 |
r<alg_type::std_fn, std::uint32_t> |
1285 ns | 491 ns | 2.62 |
r<alg_type::std_fn, std::uint64_t> |
1504 ns | 1151 ns | 1.31 |
r<alg_type::rng, std::uint8_t> |
1317 ns | 355 ns | 3.71 |
r<alg_type::rng, std::uint16_t> |
1250 ns | 330 ns | 3.79 |
r<alg_type::rng, std::uint32_t> |
1330 ns | 489 ns | 2.72 |
r<alg_type::rng, std::uint64_t> |
2090 ns | 1082 ns | 1.93 |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.
I've pushed a merge with main to resolve merge conflicts in search.cpp (structured as one commit to replicate the stupid thing I did in MSVC, followed by a proper fix, so I can mirror the latter).
Now, this directly constructs std::string from the std::string_views src_haystack and src_needle, and constructs std::vector from their .begin() and .end(). These are stylistic improvements to reduce verbosity.
Thanks for removing time when users call remove! :joy_cat: :zany_face: :stopwatch: