cpp-TimSort
cpp-TimSort copied to clipboard
Replace std::vector for better speed; also fix clang C++11
First, sorry that this PR contains more than one thing. I have a few other changes I want to propose for timsort and I'll try to keep those in standalone PRs. Because of the need for C++11 support to test this I ended up having to fix that bug as part of this performance work. I can break the libc++ changes out into a different PR if you'd rather merge that as a separate PR first.
In my application I need to sort a vector of pointers. I was profiling timsort and was surprised at how much time was being spent in copy_to_tmp()
This method was implemented as:
tmp_.clear();
tmp_.reserve(len);
GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_));
Unfortunately this performs badly on simple types like pointers. The GFX_TIMSORT_MOVE_RANGE()
macro itself can reduce to a single memmove()
but using std::back_inserter
(which is required for move-only types) breaks this optimization. Instead of a nice SSE-optimized memory copy you end up with an element-by-element loop with a vector capacity check each time.
As an experiment I did some metaprogramming so that TriviallyCopyable types would just do:
tmp_.assign(begin, begin + len);
This basically fixed the problem, but it's still not quite perfect, since the non-trivial case still is doing extra capacity checks that aren't required. I did a more aggressive fix that replaced the std::vector
use entirely with a custom data structure (since we don't really need a general purpose vector here, just a managed array) which bought another couple percent in speed.
First I had to make two preparatory changes, though:
- The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode) The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with
_LIBCPP_VERSION
set. Not sure why.
Rather than one large binary expression I instead used a larger (but hopefully easier to understand) ifdef decision tree. This can also be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1]
As before we default to using C++11 constructs unless we see some evidence that it won't work. However we now let modern versions of clang/libc++ pass.
- The parts of the
TimSort
class that don't very based onLessFunction
are now in their own set of classes such asTimSortState
This is partially just a cleanup I needed to make some template metaprogramming less gross. However, it's a good idea in any case. It's not unusual for a program to need to sort a type of data multiple ways, which means expanding the TimSort<RandomAccessIterator,LessFunction>
template multiple times. With this change, the compiler can reuse the expansion of TimSortState<RandomAccessIterator>
between them. If nothing else, this should compile faster.
Now with those things out of the way, I could get to the meat of the change: replacing the std::vector<value_t> tmp_;
with a custom TimSortMergeSpace<>
type.
This class allocates itself like a vector, but the only "setter" is a move_in()
method that replaces its contents via move construction (similar to what the std::back_inserter loop was doing) We don't construct elements before they're needed (even if we allocated them) so it will work even for
non-default-constructable types. The move-loop is faster than before since we don't need to re-check for capacity at every insertion.
However, on C++11 we do even better: we use template-specialization to provide an alternate implementation of this data type for types that pass std::is_trivially_copyable<>
. The big advantage is that we can just use std::memcpy()
to refill the merge buffer. The code is also simpler in general since we don't need to worry about construction/destruction of the buffer elements.
Since a lot of the overall cost of the timsort algorithm is spent merging, making this data structure as fast as possible is important. This change makes sorting randomized sequences about 10% faster when working with trivially-copyable types.
While I was there I also replaced the std::vector<run> pending_
with my own TimSortRunStack<>
type. This doesn't have the same importance for performance, but it's another place where we don't really need the full STL vector
support... just a simple resizable stack. Since I was replacing one vector I thought it was more consistent to just replace both. This also removes the header dependency on <vector>
RESULTS: "make bench" on Xcode 10.1, Mac Pro:
-
RANDOMIZED SEQUENCE [int] size=100K Before: 0.851 After: 0.745
-
RANDOMIZED SEQUENCE [std::string] size=100K Before: 5.389 After: 3.735
The improvement with int
is due to the vector replacement. The bigger improvement with std::string
is making C++11 work with the libc++ STL so that move optimizations get applied.
Huh, it looks like the linux+clang environment (not sure which STL) that Travis is using doesn't have the type traits stuff in the expected place:
clang++ -I. -I/opt/brew/include -L/opt/brew/lib -Wall -Wextra -g -std=c++11 test/test.cpp -lboost_unit_test_framework -o .bin/test-with-cpp11
In file included from test/test.cpp:11:
./timsort.hpp:430:61: error: no member named 'is_trivially_copyable' in
namespace 'std'
...which is odd since I'm including <type_traits>
. I'll have to investigate that a little more, although I might not have time until next weekend at the earliest. Hopefully others can look through this PR in the meantime even if it can't be merged yet.
Concerning your issue with Travis: the Linux Clang uses libstdc++ so you need to install a matching g++ along with Clang.
Otherwise I had made similar changes to the timsort implementation based of this one that I ship in cpp-sort: https://github.com/Morwenn/cpp-sort/blob/master/include/cpp-sort/detail/timsort.h
I only replaced the big vector by a std::unique_ptr
and leveraged std::move
to get the free optimizations for trvially copyable types. I once try to optimize more to use std::memcpy
but there wasn't any noticeable difference so I gave up on those changes (I guess that std::memmove
only costs a few additional comparisons when the ranges don't overlap). I would have contributed my improvements back but since they heavily use C++11 and C++14 features I didn't bother :/
Now I would be interested in comparing the relative speed of our improvements to see whether I could steal some parts of yours. I will benchmark those later to see whether there is a noticeable difference.
Concerning your issue with Travis: the Linux Clang uses libstdc++
Yeah, it's not clear to me what version of libstdc++ is being used though. The messages indicate that gcc 4.8.2 is installed in the environment but the __GLIBCXX__ < 20150422
check should disable C++11 if compiling against a libstdc++ older than the one that shipped with gcc 5.1. So I'm guessing it's compiled against a different STL than the gcc one. Unless _LIBCPP_VERSION
is also set?
libstdc++ got std::is_trivally_copyable
a bit later than the other <type_traits>
but it was still in 2014 and so I would expect them to be available at that __GLIBCXX__
level: https://gcc.gnu.org/viewcvs/gcc/trunk/libstdc%2B%2B-v3/include/std/type_traits?view=patch&r1=216032&r2=216031&pathrev=216032
I guess that std::memmove only costs a few additional comparisons when the ranges don't overlap)
Well at least the glibc implementation always copies backwards in the case that src<dst
Whether that has any impact I don't know -- probably not.
Basically performance-wise:
- Replacing the
back_inserter
withstd::move()
: obvious improvement once you get rid of that per-element loop.timsort<int>
gets ~9% faster - ...Then replacing the
vector
with a custom buffer: a much smaller win, but still seems statistically significant.. probably another ~2% or so - ...The
std::move()
->memcpy()
change was the last thing I did. I wouldn't swear that it's faster, but over a bunch of manual runs ofbench
it seemed to give slightly better numbers on average. This one was a sub-1% improvement though, so it's almost noise.
I left the memcpy()
change since there isn't any theoretical reason that it wouldn't be faster. My only concern is whether it's safe to do &*in_begin
to convert the iterator to a pointer for any RandomAccessIterator
I think it's probably OK here -- most of the caveats about that concern the case where the iterator is pointing at end()
and a debug STL build complains in operator*()
, but that's not an issue here since in_begin
is always pointing to a real input element. In theory though an iterator that returned a proxy object might not work though... but then is it likely that std::is_trivally_copyable<std::iterator_traits<T>::value_type>
is true anyways?
OK, still busted in travis. I'll have to set up a Ubuntu 14.04.5 LTS VM when I get a chance and see if it replicates there with clang.
OK, installed a VM. Issue is fairly clear.
By default 14.04LTS came with a libstdc++ snapshot from gcc 4.8 (4.8.4-2ubuntu1~14.04.4
) However because they packaged it themselves it has __GLIBCXX__
set to 20150426
I know you're not supposed to rely on the exact value of __GLIBCXX__
, but it's not clear what the best way to do this feature detection for a header-only library. I'll dig around and see if I can find something reasonable to key off of.
It seems the documented way of doing this is to rely instead on _GLIBCXX_RELEASE
, but that only exists starting at gcc 7.1, so that would have the unfortunate effect of denying c++11 usage to gcc 5/6
OK, it took some ugly hacks, but it now rejects the incomplete C++11 support in the 5-year-old Ubuntu headers that Travis is using, letting the CI tests pass.
I only replaced the big vector by a
std::unique_ptr
and leveragedstd::move
to get the free optimizations for trvially copyable types. I once try to optimize more to usestd::memcpy
but there wasn't any noticeable difference so I gave up on those changes
Depends what optimization level your compiler is at. Did you test debug mode? Debug speed is just as important for many fields including gaming. Unclear whether you're talking about comparing using std::move to memcpy here or memcpy to memmove.
Some more context: https://www.plflib.org/blog.htm#notmystdfilln
1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode) The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with `_LIBCPP_VERSION` set. Not sure why.
Thanks for that - I think at the time there was no type traits support in libcpp, that's why. But thanks for updating me, now I have to update a bunch of my code too.
Some more context: https://www.plflib.org/blog.htm#notmystdfilln
That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With -std=c++11 -O2 -march=native
:
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
memory_memcpy 8350 ns 8344 ns 82896
memory_memmove 8341 ns 8338 ns 84025
memory_copy 8352 ns 8347 ns 83763
memory_move 8365 ns 8360 ns 83269
With -O0
they're a few percent slower but, again, all basically the same. Probably it's using generic mem*()
implementations?
memory_memcpy 8753 ns 8744 ns 78766
memory_memmove 8733 ns 8724 ns 80247
memory_copy 8726 ns 8720 ns 79595
memory_move 8719 ns 8714 ns 79808
Now if you compiled with an STL that didn't understand modern type_traits
you might see some wider differences... although it's still possible an optimization phase would detect the copy-loop and replace it.
Some more context: https://www.plflib.org/blog.htm#notmystdfilln
That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With
-std=c++11 -O2 -march=native
:
Yes, I noted in the blog that clang optimizes this at -02, didn't realise it did it at -O0 as well. The main point is, you can't rely on the compiler. In GCC it only optimized at -O3, but I brought it up and it may be fixed in a future revision.
As far as I know every STL that supports C++14 (which is my own target) already has the type traits and uses compile-time dispatch to optimize std::copy
and std::move
down to std::memcpy
and std::memmove
when the iterator types can be treated as pointers to trivially copyable types, so the apart from an overlap check with std::memmove
there shouldn't be much of a difference. The difference wrt std::fill
vs. std::memset
in the blog post is explained by the fast that libstdc++ only optimizes std::fill
to std::memset
for character types (there is probably a reason but I don't know which), moreover libc++ recently removed the optimization for std::fill
because their built-in memset
wasn't constexpr
(but they might bring the optimization back with std::is_constant_evaluated
).
If you want cool memory optimizations, tvanslyke/timsort-cpp implements more of them than usual algorithms, but as far as I know, the whole reinterpret_cast
gig in the project is UB by the standard despite the author taking great care to avoid alignment issues.
Yes, I agree with @Morwenn -- as long as the STL is taking advantage of C++11 <type_traits>
, then the std::copy()
->memmove()
transformation should always happen before the optimizer even sees the code.
A std::fill
->memset()
transformation requires the compiler to actually detect the int-wise initializing loop can be replaced with a byte-wise memset()
(which is dependent on the int
being a constant matching a particular pattern), so you won't get that at -O0
That tvanslyke timsort version is interesting.. unfortunately I need a more portable C++ version (fast on C++11, but still works on others) which is why I'm focussed on improving this gfx one. His changes to the binary_sort are interesting; I was looking in the same area for possible future work.
Like I said, it wasn't an issue with libstdc++, it was an issue with GCC. Cheers-