cpp-TimSort icon indicating copy to clipboard operation
cpp-TimSort copied to clipboard

Replace std::vector for better speed; also fix clang C++11

Open mitchblank opened this issue 6 years ago • 14 comments

First, sorry that this PR contains more than one thing. I have a few other changes I want to propose for timsort and I'll try to keep those in standalone PRs. Because of the need for C++11 support to test this I ended up having to fix that bug as part of this performance work. I can break the libc++ changes out into a different PR if you'd rather merge that as a separate PR first.

In my application I need to sort a vector of pointers. I was profiling timsort and was surprised at how much time was being spent in copy_to_tmp() This method was implemented as:

tmp_.clear();
tmp_.reserve(len);
GFX_TIMSORT_MOVE_RANGE(begin, begin + len, std::back_inserter(tmp_));

Unfortunately this performs badly on simple types like pointers. The GFX_TIMSORT_MOVE_RANGE() macro itself can reduce to a single memmove() but using std::back_inserter (which is required for move-only types) breaks this optimization. Instead of a nice SSE-optimized memory copy you end up with an element-by-element loop with a vector capacity check each time.

As an experiment I did some metaprogramming so that TriviallyCopyable types would just do:

tmp_.assign(begin, begin + len);

This basically fixed the problem, but it's still not quite perfect, since the non-trivial case still is doing extra capacity checks that aren't required. I did a more aggressive fix that replaced the std::vector use entirely with a custom data structure (since we don't really need a general purpose vector here, just a managed array) which bought another couple percent in speed.

First I had to make two preparatory changes, though:

  1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode) The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with _LIBCPP_VERSION set. Not sure why.

Rather than one large binary expression I instead used a larger (but hopefully easier to understand) ifdef decision tree. This can also be overridden on the command-line with -DGFX_TIMSORT_USE_CXX11=[0|1] As before we default to using C++11 constructs unless we see some evidence that it won't work. However we now let modern versions of clang/libc++ pass.

  1. The parts of the TimSort class that don't very based on LessFunction are now in their own set of classes such as TimSortState

This is partially just a cleanup I needed to make some template metaprogramming less gross. However, it's a good idea in any case. It's not unusual for a program to need to sort a type of data multiple ways, which means expanding the TimSort<RandomAccessIterator,LessFunction> template multiple times. With this change, the compiler can reuse the expansion of TimSortState<RandomAccessIterator> between them. If nothing else, this should compile faster.

Now with those things out of the way, I could get to the meat of the change: replacing the std::vector<value_t> tmp_; with a custom TimSortMergeSpace<> type.

This class allocates itself like a vector, but the only "setter" is a move_in() method that replaces its contents via move construction (similar to what the std::back_inserter loop was doing) We don't construct elements before they're needed (even if we allocated them) so it will work even for non-default-constructable types. The move-loop is faster than before since we don't need to re-check for capacity at every insertion.

However, on C++11 we do even better: we use template-specialization to provide an alternate implementation of this data type for types that pass std::is_trivially_copyable<>. The big advantage is that we can just use std::memcpy() to refill the merge buffer. The code is also simpler in general since we don't need to worry about construction/destruction of the buffer elements.

Since a lot of the overall cost of the timsort algorithm is spent merging, making this data structure as fast as possible is important. This change makes sorting randomized sequences about 10% faster when working with trivially-copyable types.

While I was there I also replaced the std::vector<run> pending_ with my own TimSortRunStack<> type. This doesn't have the same importance for performance, but it's another place where we don't really need the full STL vector support... just a simple resizable stack. Since I was replacing one vector I thought it was more consistent to just replace both. This also removes the header dependency on <vector>

RESULTS: "make bench" on Xcode 10.1, Mac Pro:

  • RANDOMIZED SEQUENCE [int] size=100K Before: 0.851 After: 0.745

  • RANDOMIZED SEQUENCE [std::string] size=100K Before: 5.389 After: 3.735

The improvement with int is due to the vector replacement. The bigger improvement with std::string is making C++11 work with the libc++ STL so that move optimizations get applied.

mitchblank avatar Jan 14 '19 01:01 mitchblank

Huh, it looks like the linux+clang environment (not sure which STL) that Travis is using doesn't have the type traits stuff in the expected place:

clang++ -I. -I/opt/brew/include -L/opt/brew/lib -Wall -Wextra -g  -std=c++11 test/test.cpp -lboost_unit_test_framework -o .bin/test-with-cpp11
In file included from test/test.cpp:11:
./timsort.hpp:430:61: error: no member named 'is_trivially_copyable' in
      namespace 'std'

...which is odd since I'm including <type_traits>. I'll have to investigate that a little more, although I might not have time until next weekend at the earliest. Hopefully others can look through this PR in the meantime even if it can't be merged yet.

mitchblank avatar Jan 14 '19 01:01 mitchblank

Concerning your issue with Travis: the Linux Clang uses libstdc++ so you need to install a matching g++ along with Clang.

Otherwise I had made similar changes to the timsort implementation based of this one that I ship in cpp-sort: https://github.com/Morwenn/cpp-sort/blob/master/include/cpp-sort/detail/timsort.h

I only replaced the big vector by a std::unique_ptr and leveraged std::move to get the free optimizations for trvially copyable types. I once try to optimize more to use std::memcpy but there wasn't any noticeable difference so I gave up on those changes (I guess that std::memmove only costs a few additional comparisons when the ranges don't overlap). I would have contributed my improvements back but since they heavily use C++11 and C++14 features I didn't bother :/

Now I would be interested in comparing the relative speed of our improvements to see whether I could steal some parts of yours. I will benchmark those later to see whether there is a noticeable difference.

Morwenn avatar Jan 14 '19 08:01 Morwenn

Concerning your issue with Travis: the Linux Clang uses libstdc++

Yeah, it's not clear to me what version of libstdc++ is being used though. The messages indicate that gcc 4.8.2 is installed in the environment but the __GLIBCXX__ < 20150422 check should disable C++11 if compiling against a libstdc++ older than the one that shipped with gcc 5.1. So I'm guessing it's compiled against a different STL than the gcc one. Unless _LIBCPP_VERSION is also set?

libstdc++ got std::is_trivally_copyable a bit later than the other <type_traits> but it was still in 2014 and so I would expect them to be available at that __GLIBCXX__ level: https://gcc.gnu.org/viewcvs/gcc/trunk/libstdc%2B%2B-v3/include/std/type_traits?view=patch&r1=216032&r2=216031&pathrev=216032

mitchblank avatar Jan 14 '19 11:01 mitchblank

I guess that std::memmove only costs a few additional comparisons when the ranges don't overlap)

Well at least the glibc implementation always copies backwards in the case that src<dst Whether that has any impact I don't know -- probably not.

Basically performance-wise:

  • Replacing the back_inserter with std::move(): obvious improvement once you get rid of that per-element loop. timsort<int> gets ~9% faster
  • ...Then replacing the vector with a custom buffer: a much smaller win, but still seems statistically significant.. probably another ~2% or so
  • ...The std::move()->memcpy() change was the last thing I did. I wouldn't swear that it's faster, but over a bunch of manual runs of bench it seemed to give slightly better numbers on average. This one was a sub-1% improvement though, so it's almost noise.

I left the memcpy() change since there isn't any theoretical reason that it wouldn't be faster. My only concern is whether it's safe to do &*in_begin to convert the iterator to a pointer for any RandomAccessIterator

I think it's probably OK here -- most of the caveats about that concern the case where the iterator is pointing at end() and a debug STL build complains in operator*(), but that's not an issue here since in_begin is always pointing to a real input element. In theory though an iterator that returned a proxy object might not work though... but then is it likely that std::is_trivally_copyable<std::iterator_traits<T>::value_type> is true anyways?

mitchblank avatar Jan 14 '19 11:01 mitchblank

OK, still busted in travis. I'll have to set up a Ubuntu 14.04.5 LTS VM when I get a chance and see if it replicates there with clang.

mitchblank avatar Jan 14 '19 12:01 mitchblank

OK, installed a VM. Issue is fairly clear.

By default 14.04LTS came with a libstdc++ snapshot from gcc 4.8 (4.8.4-2ubuntu1~14.04.4) However because they packaged it themselves it has __GLIBCXX__ set to 20150426

I know you're not supposed to rely on the exact value of __GLIBCXX__, but it's not clear what the best way to do this feature detection for a header-only library. I'll dig around and see if I can find something reasonable to key off of.

mitchblank avatar Jan 14 '19 12:01 mitchblank

It seems the documented way of doing this is to rely instead on _GLIBCXX_RELEASE, but that only exists starting at gcc 7.1, so that would have the unfortunate effect of denying c++11 usage to gcc 5/6

mitchblank avatar Jan 14 '19 12:01 mitchblank

OK, it took some ugly hacks, but it now rejects the incomplete C++11 support in the 5-year-old Ubuntu headers that Travis is using, letting the CI tests pass.

mitchblank avatar Jan 14 '19 13:01 mitchblank

I only replaced the big vector by a std::unique_ptr and leveraged std::move to get the free optimizations for trvially copyable types. I once try to optimize more to use std::memcpy but there wasn't any noticeable difference so I gave up on those changes

Depends what optimization level your compiler is at. Did you test debug mode? Debug speed is just as important for many fields including gaming. Unclear whether you're talking about comparing using std::move to memcpy here or memcpy to memmove.

Some more context: https://www.plflib.org/blog.htm#notmystdfilln

1. The C++11 support was broken on libc++ (which is the default STL for most clang installs, including Xcode)  The changes @mattreecebentley made to auto-detect C++11 were mostly good but they specifically rejected anything with `_LIBCPP_VERSION` set.  Not sure why.

Thanks for that - I think at the time there was no type traits support in libcpp, that's why. But thanks for updating me, now I have to update a bunch of my code too.

mattreecebentley avatar Jan 14 '19 22:01 mattreecebentley

Some more context: https://www.plflib.org/blog.htm#notmystdfilln

That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With -std=c++11 -O2 -march=native:

------------------------------------------------------
Benchmark               Time           CPU Iterations
------------------------------------------------------
memory_memcpy        8350 ns       8344 ns      82896
memory_memmove       8341 ns       8338 ns      84025
memory_copy          8352 ns       8347 ns      83763
memory_move          8365 ns       8360 ns      83269

With -O0 they're a few percent slower but, again, all basically the same. Probably it's using generic mem*() implementations?

memory_memcpy        8753 ns       8744 ns      78766
memory_memmove       8733 ns       8724 ns      80247
memory_copy          8726 ns       8720 ns      79595
memory_move          8719 ns       8714 ns      79808

Now if you compiled with an STL that didn't understand modern type_traits you might see some wider differences... although it's still possible an optimization phase would detect the copy-loop and replace it.

mitchblank avatar Jan 15 '19 01:01 mitchblank

Some more context: https://www.plflib.org/blog.htm#notmystdfilln

That's interesting. At least on libc++ it doesn't seem to be the case that one is faster than the other. With -std=c++11 -O2 -march=native:

Yes, I noted in the blog that clang optimizes this at -02, didn't realise it did it at -O0 as well. The main point is, you can't rely on the compiler. In GCC it only optimized at -O3, but I brought it up and it may be fixed in a future revision.

mattreecebentley avatar Jan 15 '19 06:01 mattreecebentley

As far as I know every STL that supports C++14 (which is my own target) already has the type traits and uses compile-time dispatch to optimize std::copy and std::move down to std::memcpy and std::memmove when the iterator types can be treated as pointers to trivially copyable types, so the apart from an overlap check with std::memmove there shouldn't be much of a difference. The difference wrt std::fill vs. std::memset in the blog post is explained by the fast that libstdc++ only optimizes std::fill to std::memset for character types (there is probably a reason but I don't know which), moreover libc++ recently removed the optimization for std::fill because their built-in memset wasn't constexpr (but they might bring the optimization back with std::is_constant_evaluated).

If you want cool memory optimizations, tvanslyke/timsort-cpp implements more of them than usual algorithms, but as far as I know, the whole reinterpret_cast gig in the project is UB by the standard despite the author taking great care to avoid alignment issues.

Morwenn avatar Jan 15 '19 08:01 Morwenn

Yes, I agree with @Morwenn -- as long as the STL is taking advantage of C++11 <type_traits>, then the std::copy()->memmove() transformation should always happen before the optimizer even sees the code.

A std::fill->memset() transformation requires the compiler to actually detect the int-wise initializing loop can be replaced with a byte-wise memset() (which is dependent on the int being a constant matching a particular pattern), so you won't get that at -O0

That tvanslyke timsort version is interesting.. unfortunately I need a more portable C++ version (fast on C++11, but still works on others) which is why I'm focussed on improving this gfx one. His changes to the binary_sort are interesting; I was looking in the same area for possible future work.

mitchblank avatar Jan 15 '19 13:01 mitchblank

Like I said, it wasn't an issue with libstdc++, it was an issue with GCC. Cheers-

mattreecebentley avatar Jan 15 '19 21:01 mattreecebentley