range-v3
range-v3 copied to clipboard
Major Performance Issues with view::concat
Hello, during the course of adding some ranges views to my code I started suddenly to see a major performance regression. After a lot of testing I reduced it down to the following. I have a "no-op" pipeable that has three possible implementations, each of which should be equivalent for the small ranges I'm dealing with:
auto my_pipe = ranges::make_pipeable( []( auto r ) {
return r; // (1)
return view::concat( r ); // (2)
return r | view::take( 10000000000 ); // (3)
} );
and this will get used many times in an inner loop with r
having ~10 elements. What surprised me is that (1) and (3) are fast, whereas (2) seems to cause a major slowdown. I am using a recent clang trunk compiler with optimizations off (I demand that my code be reasonably performant without them). Any ideas?
Thanks
Here are some profiling numbers and a standalone reproducer:
int counter = 0;
vector<int> v{1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
auto my_pipe = rg::make_pipeable( []( auto r ) {
// choose one of the following:
return r; // (1)
return view::concat( r ); // (2)
return r | view::take( 10000000000 ); // (3)
} );
auto under_test = [&] {
for( auto e : v | my_pipe )
counter++;
};
for( int i = 0; i < 4000000; ++i )
under_test();
cout << counter; // correctly prints 40,000,000 in all cases
Times on my machine:
(1): ~2.5 seconds
(2): ~36 seconds (!!)
(3): ~2.5 seconds
Info about my setup:
- recent clang trunk with -std=c++2a
- linux64 debug build / no optimizations
- range-v3 version 64467567acad438195ad25ca13a63e134f476d65
A few extra points:
- with optimizations at -O2 all three cases perform the same (and much faster)
- same results with gcc 8 for both debug/release
So it appears that this issue does not surface in optimized builds, but I am hoping that someone can shed some light on it anyway because, even for a debug build, it seems abnormally slow, to the point that I will have to remove usage of it from my code because for me it's too slow even for debug.
concat
is much harder for a compiler to optimize. It iterates one range at a time, so the concat
iterator only references one subrange's iterator at a time. In the interest of keeping concat
iterators small, it stores them in a variant
. variant
s do type punning, which makes optimization difficult.
Certainly in this case when you are "concat
"-ing only one range, there is no need for a variant. So that is one potential optimization. Beyond that, though, the alternative is between fatter concat
iterators (not using a variant
) and slower debug (and possibly even release) perf. It's a judgement call.
Would it be feasible/desirable to treat the 1,2 argument cases specially and have them use fat iterators, then beyond that, just use the variant? Do I understand correctly that fat iterators would be more performant in both debug and release builds? If so then that approach would be nice for my use cases, though I'm not sure what the most common number of arguments will be to this function general use cases.
In any case, I will leave this to the experts to decide, just wanted to point it out. Thanks!
A couple of points:
-
There's never a good reason to write
view::concat(single_range)
; its result is equal toview::all(single_range)
except that it takes 10-100x as much compiletime and/or runtime. We should optimize this in the library by havingview::concat(single_range)
returnview::all(single_range)
- sincesingle_range
could e.g. be the result of expanding a pack of one ranges - but we haven't yet done so. -
The lambda from your repro:
[]( auto r ) { return r; // (1) return view::concat( r ); // (2) return r | view::take( 10000000000 ); // (3) }
returns dangling
view
s that adaptr
. When you create aview
of an lvaluerange
that is not a model ofview
- we refer to suchrange
s as "containers" sometimes, despite that the formal definition of "container" from the C++ Standard applies only to a subset of these types - you are responsible for ensuring that the lifetime of the lvaluerange
encompasses the lifetime of theview
. We generally recommend takingrange
arguments by forwarding reference:[]( auto&& r ) { return std::forward<decltype(r)>( r ); // (1) return view::concat( std::forward<decltype(r)>( r ) ); // (2) return std::forward<decltype(r)>( r ) | view::take( 10000000000 ); // (3) }
which makes it harder to make such lifetime errors since the library will refuse to create views of rvalue "containers".
Hi Casey, thanks for pointing that out!
And just to clarify, in my actual code I am calling concat
on two ranges (I agree there is no need to concat
a single range). I thought I would reduce it to a single range in the reproducer just to make things simpler, given that it also demonstrates the performance issue.
I might venture to guess that calling concat
with two arguments would be the most common use case, no? If so, perhaps it would make sense (if feasible) to treat that case specially and use fat iterators instead of the variant?
Thanks for that optimization, I bet that will help improve the performance of generic code that might end up concat
'ing a single range.
Since my original issue still remains (slow performance of concat
with multiple arguments) I have done some more profiling for that situation with results below. TL;DR: views::concat
with two arguments is staggeringly slower than the equivalent "manual" concatenation:
template<typename Rng1, typename Rng2>
int concat_using_range_v3( Rng1&& r1, Rng2&& r2 ) {
int sum = 0;
for( auto i : views::concat( FWD( r1 ), FWD( r2 ) ) )
sum += i;
return sum;
}
template<typename Rng1, typename Rng2>
int concat_manually( Rng1&& r1, Rng2&& r2 ) {
int sum = 0;
for( auto i : FWD( r1 ) ) sum += i;
for( auto i : FWD( r2 ) ) sum += i;
return sum;
}
int to_profile() {
vector<int> v1{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
vector<int> v2{10, 11, 12, 13, 14, 15, 16, 17, 18, 19};
int sum = 0;
for( int i = 0; i < 1000000; ++i ) {
// Choose one below.
sum += concat_using_range_v3( v1, v2 );
sum += concat_manually( v1, v2 );
}
return sum;
}
Upon running the to_profile
function (which correctly returns 190000000
in all cases) and timing it in isolation:
Debug Build:
-
concat_using_range_v3
: 18000ms (!!) -
concat_manually
: 270ms
Release (-O3
) Build:
-
concat_using_range_v3
: 17ms -
concat_manually
: 0ms
I think ideally the performance of these two cases should be roughly the same. Can anything be done about this?
Not sure it's better to put here or elsewhere, but I'm running into similar problems with slow concat (on -O3). Here's some fairly simple code to reproduce:
#include <range/v3/view.hpp>
#include <variant>
#include <iostream>
constexpr bool use_concat = true;
constexpr bool use_range_value = true;
typedef std::array<int,1> type_0;
typedef std::array<int,2> type_1;
int main(int argc, char** argv) {
size_t counter = 0;
if (argc != 3) {
std::cout << "This program takes 2 runtimes arguments:\n\t- The range start \n\t- The range end\n";
return -1;
}
size_t start = atoi(argv[1]);
size_t end= atoi(argv[2]);
const auto view_0 = ranges::views::iota(start, end) |
ranges::views::transform([](size_t && x){
type_0 rc;
rc[0] = x;
return(rc);
});
const auto view_1 = ranges::views::iota(start, end) |
ranges::views::transform([](size_t && x){
type_1 rc;
rc[0] = x;
rc[1] = x;
return(rc);
});
if constexpr (use_concat){
for (auto&& e : ranges::views::concat(view_0, view_1)) {
std::visit(
[&counter](const auto&x){
if constexpr (use_range_value){
counter += x[0];
} else {
++counter;
}
}, e);
}
} else {
for (auto&& e : view_0) {
if constexpr (use_range_value){
counter += e[0];
} else {
++counter;
}
}
for (auto&& e : view_1) {
if constexpr (use_range_value){
counter += e[0];
} else {
++counter;
}
}
}
std::cout << counter << "\n";
}
use_concat | use_range_value | time (ms) |
---|---|---|
true | true | ~11800 |
true | false | ~11800 |
false | true | ~630 |
false | false | ~1 |
I realize use_range_value=false
is not a reasonable use case, but it seems like it might help clarify what's happening.
I ran all of this using clang version 11.0.2
, with -O3 -std=c++20
and -O3 -std=c++17
to similar results