range-v3 Major Performance Issues with view::concat

Hello, during the course of adding some ranges views to my code I started suddenly to see a major performance regression. After a lot of testing I reduced it down to the following. I have a "no-op" pipeable that has three possible implementations, each of which should be equivalent for the small ranges I'm dealing with:

auto my_pipe = ranges::make_pipeable( []( auto r ) { 
  return r; //  (1)
  return view::concat( r ); //  (2)
  return r | view::take( 10000000000 ); //  (3)
} );

and this will get used many times in an inner loop with r having ~10 elements. What surprised me is that (1) and (3) are fast, whereas (2) seems to cause a major slowdown. I am using a recent clang trunk compiler with optimizations off (I demand that my code be reasonably performant without them). Any ideas?

Thanks

Mar 11 '19 14:03 no-more-secrets

Here are some profiling numbers and a standalone reproducer:

int counter = 0;
vector<int> v{1, 2, 3, 4, 5, 6, 7, 8, 9, 10};

auto my_pipe = rg::make_pipeable( []( auto r ) {
  // choose one of the following:
  return r;                             //  (1)
  return view::concat( r );             //  (2)
  return r | view::take( 10000000000 ); //  (3)
} );

auto under_test = [&] {
  for( auto e : v | my_pipe )
    counter++;
};

for( int i = 0; i < 4000000; ++i )
  under_test();

cout << counter;  // correctly prints 40,000,000 in all cases

Times on my machine:

(1): ~2.5 seconds
(2): ~36 seconds (!!)
(3): ~2.5 seconds

Info about my setup:

recent clang trunk with -std=c++2a
linux64 debug build / no optimizations
range-v3 version 64467567acad438195ad25ca13a63e134f476d65

Mar 11 '19 16:03 no-more-secrets

A few extra points:

with optimizations at -O2 all three cases perform the same (and much faster)
same results with gcc 8 for both debug/release

So it appears that this issue does not surface in optimized builds, but I am hoping that someone can shed some light on it anyway because, even for a debug build, it seems abnormally slow, to the point that I will have to remove usage of it from my code because for me it's too slow even for debug.

Mar 11 '19 21:03 no-more-secrets

concat is much harder for a compiler to optimize. It iterates one range at a time, so the concat iterator only references one subrange's iterator at a time. In the interest of keeping concat iterators small, it stores them in a variant. variants do type punning, which makes optimization difficult.

Certainly in this case when you are "concat"-ing only one range, there is no need for a variant. So that is one potential optimization. Beyond that, though, the alternative is between fatter concat iterators (not using a variant) and slower debug (and possibly even release) perf. It's a judgement call.

Sep 12 '19 17:09 ericniebler

Would it be feasible/desirable to treat the 1,2 argument cases specially and have them use fat iterators, then beyond that, just use the variant? Do I understand correctly that fat iterators would be more performant in both debug and release builds? If so then that approach would be nice for my use cases, though I'm not sure what the most common number of arguments will be to this function general use cases.

In any case, I will leave this to the experts to decide, just wanted to point it out. Thanks!

Sep 12 '19 17:09 no-more-secrets

A couple of points:

There's never a good reason to write view::concat(single_range); its result is equal to view::all(single_range) except that it takes 10-100x as much compiletime and/or runtime. We should optimize this in the library by having view::concat(single_range) return view::all(single_range) - since single_range could e.g. be the result of expanding a pack of one ranges - but we haven't yet done so.
The lambda from your repro:
```
[]( auto r ) { 
  return r; //  (1)
  return view::concat( r ); //  (2)
  return r | view::take( 10000000000 ); //  (3)
}
```
returns dangling views that adapt r. When you create a view of an lvalue range that is not a model of view - we refer to such ranges as "containers" sometimes, despite that the formal definition of "container" from the C++ Standard applies only to a subset of these types - you are responsible for ensuring that the lifetime of the lvalue range encompasses the lifetime of the view. We generally recommend taking range arguments by forwarding reference:
```
[]( auto&& r ) { 
  return std::forward<decltype(r)>( r ); //  (1)
  return view::concat( std::forward<decltype(r)>( r ) ); //  (2)
  return std::forward<decltype(r)>( r ) | view::take( 10000000000 ); //  (3)
}
```
which makes it harder to make such lifetime errors since the library will refuse to create views of rvalue "containers".

Sep 12 '19 18:09 CaseyCarter

Hi Casey, thanks for pointing that out!

And just to clarify, in my actual code I am calling concat on two ranges (I agree there is no need to concat a single range). I thought I would reduce it to a single range in the reproducer just to make things simpler, given that it also demonstrates the performance issue.

I might venture to guess that calling concat with two arguments would be the most common use case, no? If so, perhaps it would make sense (if feasible) to treat that case specially and use fat iterators instead of the variant?

Sep 12 '19 19:09 no-more-secrets

Thanks for that optimization, I bet that will help improve the performance of generic code that might end up concat'ing a single range.

Since my original issue still remains (slow performance of concat with multiple arguments) I have done some more profiling for that situation with results below. TL;DR: views::concat with two arguments is staggeringly slower than the equivalent "manual" concatenation:

template<typename Rng1, typename Rng2>
int concat_using_range_v3( Rng1&& r1, Rng2&& r2 ) {
  int sum = 0;
  for( auto i : views::concat( FWD( r1 ), FWD( r2 ) ) )
    sum += i;
  return sum;
}

template<typename Rng1, typename Rng2>
int concat_manually( Rng1&& r1, Rng2&& r2 ) {
  int sum = 0;
  for( auto i : FWD( r1 ) ) sum += i;
  for( auto i : FWD( r2 ) ) sum += i;
  return sum;
}

int to_profile() {
  vector<int> v1{0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
  vector<int> v2{10, 11, 12, 13, 14, 15, 16, 17, 18, 19};

  int sum = 0;

  for( int i = 0; i < 1000000; ++i ) {
    // Choose one below.
    sum += concat_using_range_v3( v1, v2 );
    sum += concat_manually( v1, v2 );
  }
  return sum;
}

Upon running the to_profile function (which correctly returns 190000000 in all cases) and timing it in isolation:

Debug Build:

concat_using_range_v3: 18000ms (!!)
concat_manually: 270ms

Release (-O3) Build:

concat_using_range_v3: 17ms
concat_manually: 0ms

I think ideally the performance of these two cases should be roughly the same. Can anything be done about this?

Sep 13 '19 16:09 no-more-secrets

Not sure it's better to put here or elsewhere, but I'm running into similar problems with slow concat (on -O3). Here's some fairly simple code to reproduce:

#include <range/v3/view.hpp>
#include <variant>
#include <iostream>


constexpr bool use_concat = true;
constexpr bool use_range_value = true;

typedef std::array<int,1> type_0;
typedef std::array<int,2> type_1;


int main(int argc, char** argv) {
  size_t counter = 0;
  if (argc != 3) {
    std::cout << "This program takes 2 runtimes arguments:\n\t- The range start \n\t- The range end\n";
    return -1;
  }
  size_t start = atoi(argv[1]);
  size_t end= atoi(argv[2]);

  const auto view_0 = ranges::views::iota(start, end) |
    ranges::views::transform([](size_t && x){
      type_0 rc;
      rc[0] = x;
      return(rc);
    });
  
  const auto view_1 = ranges::views::iota(start, end) |
    ranges::views::transform([](size_t && x){
      type_1 rc;
      rc[0] = x;
      rc[1] = x;
      return(rc);
    });
  
  if constexpr (use_concat){
    for (auto&& e : ranges::views::concat(view_0, view_1)) {
      std::visit(
        [&counter](const auto&x){
          if constexpr (use_range_value){
            counter += x[0];
          } else {
            ++counter;
          }
        }, e);
    }
  } else {
    for (auto&& e : view_0) {
      if constexpr (use_range_value){
        counter += e[0];
      } else {
        ++counter;
      }
    }
    for (auto&& e : view_1) {
      if constexpr (use_range_value){
        counter += e[0];
      } else {
        ++counter;
      }
    }
  }

  std::cout << counter << "\n";
}

use_concat	use_range_value	time (ms)
true	true	~11800
true	false	~11800
false	true	~630
false	false	~1

I realize use_range_value=false is not a reasonable use case, but it seems like it might help clarify what's happening.

I ran all of this using clang version 11.0.2, with -O3 -std=c++20 and -O3 -std=c++17 to similar results

Feb 22 '21 20:02 jkamins7

range-v3 range-v3 copied to clipboard

Major Performance Issues with view::concat

range-v3
range-v3 copied to clipboard