thrust variadic template implementation of thrust::tuple

thrust::tuple does not use variadic templates. A variadic template implementation of thrust::tuple would not just make the internal implementation more elegant, there are serious practical issues with the current thrust::tuple implementation:

Only tuples of size 10 may be used. Nested tuples are often an inconvenient workaround and are not always feasible.
Compilation speed is poor, which could otherwise be significantly improved by using variadic templates. [1]
Compiler error diagnostics are often unreadable due to thrust::null_type everywhere.

In consideration of the fact that C++11 support in nvcc is still in the release candidate, undocumented, and requires host compiler support, I don't expect C++03 support to be dropped anytime soon. Instead, can a user-supplied compile-time macro (e.g. THRUST_USE_CXX11) or auto-detection of __cplusplus >= 201103L please be considered to switch to an alternate implementation?

[1] http://www.jot.fm/issues/issue_2008_02/article2.pdf

Jul 10 '14 12:07 andrewcorrigan

See also the discussion on the mailing list [1].

[1] https://groups.google.com/d/msg/thrust-users/FKf5n11OuAQ/1tmoQABfqNwJ

Sep 26 '14 15:09 andrewcorrigan

I've begun a variadic branch in my repository. There are numerous tasks that must be accomplished to have a variadic thrust::tuple. I've already begun making progress in my first commit. However, so far that was just low-hanging fruit -- I am neither a Thrust nor C++11 expert, I am just a user in desperate need of a variadic thrust::tuple. I've only just started learning how to implement variadic templates and functions over the past few days, so I welcome guidance or hands-on collaboration on this.

First, the rest of Thrust, needs to be made compatible with the public interface of tuple, and no longer rely on the implementation details of thrust::tuple.

Provide variadic implementations of functions and meta-programming constructs which generalize their current assumption of a tuple of size up to or exactly ten.
Avoid thrust::null_type. In one instance it was simply replaced with thrust::tuple<>, but otherwise, completely eliminating thrust::null_type completely may have to wait until a true variadic tuple is used since null_type is what makes a lot of code/meta-programming with thrust::tuple work.
Avoid tuple::head_type and tuple::tail_type.
Avoid detail::cons.

Then, an equivalent to <tuple> needs to be provided. I am aware currently of two options:

Just #include <tuple> and using std::tuple. This would only work for host code, but could be used for development for the time being, since nvcc does not provide a __device__ aware std::tuple, but perhaps a future release will provide this and it would just be a matter of waiting?
This standalone std::tuple implementation.

I have a question: should tuple_cat be allowed to make use of the internal implementation details of tuple?

IMO, anything provided by #include <tuple> should be allowed to. So for example tuple_cat could use internal implementation detail, but tuple_io would not be allowed to. If ultimately std::tuple is used, then tuple_cat will be provided automatically, so the current implementation-detail-specific tuple_cat would be overridden.
The current tuple_cat in the tuple_io + tuple_cat pull request assumes sized-10 tuples and this is necessary for the current non-variadic thrust::tuple implementation to work.
On the other hand, this standalone std::tuple implementation does not provide tuple_cat.

Finally, I want to mention, that in my branch I am simply replacing the non-variadic code. I don't expect that approach to be taken with mainline Thrust. If and when this code is ready to be merged, I would only expect it to be experimental functionality guarded by __cplusplus >= 201103L or some other Thrust-specific macro that a user must opt-in to.

Jan 26 '15 15:01 andrewcorrigan

I have fixed all the relevant code in the rest of Thrust to support variadic tuples.

I then tried the two approaches to replace non-variadic thrust::tuple. Neither is currently working, although the first is further along, and that is what I have committed to the variadic Thrust branch.

Including #include <tuple> and aliasing into the Thrust namespace. This can be enabled with the current repository by #define THRUST_VARIADIC_TUPLE before #include <thrust/tuple.h>. I am currently stuck on the error:

/thrust/thrust/tuple.h:57:35: Implicit instantiation of undefined template 'std::__1::tuple_size<thrust::detail::tuple_of_iterator_references<int, thrust::device_reference<int> > >'

I tried various things to define tuple_size for tuple_of_iterator_references, but nothing seems to work.

I tried embedding @jaredhoberock's std::tuple implementation, but there is an issue (https://github.com/jaredhoberock/tuple/issues/2), and I am still seeing the same error as above when using std::tuple, so I didn't commit this code. Hopefully long-term this will work out, since then we can annotate tuple with __device__ decorators and have everything working with CUDA.

Jan 27 '15 04:01 andrewcorrigan

Thrust now has variadic tuple support. The tuple implementation from agency did the trick. You need to define THRUST_VARIADIC_TUPLE to use the variadic implementation.

Example:

#include <sstream>
#include <iostream>
#define THRUST_VARIADIC_TUPLE
#include <thrust/tuple.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>

int main(int argc, char** argv)
{
    auto zip11_start =
    thrust::make_zip_iterator(thrust::make_tuple(thrust::make_counting_iterator(1),
                                                 thrust::make_counting_iterator(2),
                                                 thrust::make_counting_iterator(3),
                                                 thrust::make_counting_iterator(4),
                                                 thrust::make_counting_iterator(5),
                                                 thrust::make_counting_iterator(6),
                                                 thrust::make_counting_iterator(7),
                                                 thrust::make_counting_iterator(8),
                                                 thrust::make_counting_iterator(9),
                                                 thrust::make_counting_iterator(10),
                                                 thrust::make_counting_iterator(11) ));

    auto a = *zip11_start;

    std::cout << "C++" << thrust::tuple_size<decltype(a)>::value << " enables tuples of length " << thrust::tuple_size<decltype(a)>::value << std::endl;
    std::cout << "For example: " << a << std::endl;

    auto b = thrust::tuple_cat(a,a);
    std::cout << "Thankfully, we don't have to wait for C++" << thrust::tuple_size<decltype(b)>::value << " for tuples of length " << thrust::tuple_size<decltype(b)>::value << std::endl;
    std::cout << "We already have them right now: " << b << std::endl; 

    return 0;
}

Gives output:

% clang++ -std=c++11 test.cpp -I/Users/acorriga/code/thrust
% ./a.out
C++11 enables tuples of length 11
For example: (1 2 3 4 5 6 7 8 9 10 11)
Thankfully we don't have to wait for C++22 for tuples of length 22
We already have them right now: (1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11)

Known issues:

I defined operator==, operator!=, operator<. Other tuple operators are not defined
std::swap is called from __device__ code.
-std=c++1y does not work. I suspect this has to do with:

             class = typename std::enable_if<
               (sizeof...(Types) == 1) &&
               std::is_constructible<Type1,const UType1&>::value // XXX fill in the rest of these
             >::type>

I have not run the unit tests, but I have run my code which uses most of Thrust with many different ways and everything has held together.

Jan 28 '15 05:01 andrewcorrigan

I have updated my variadic branch [1] using the latest non-recursive, variadic tuple code from [2].

Using the new variadic tuple, my code sped up by 50% when using the cpp backend with clang. I encourage other Thrust users to give this a try. You might be pleasantly surprised by a similar performance boost in your code if you use tuple a lot like my code does!

Known issues: The known issues mentioned previously are resolved. I compiled and ran with g++-4.8, AppleClang 6.0 on the CPU backend, and nvcc 7.0 rc on Mac, and the code works. A new known issue is that compiling on Linux, with GCC triggers an internal compiler error. I just submitted this to NVIDIA with bug-id 1608271:

thrust/thrust/detail/tuple/variadic_tuple.h(395): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r346/r346_00/drivers/compiler/edg/EDG_4.9/src/expr.c", line 3108

[1] https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/tree/variadic [2] https://github.com/jaredhoberock/tuple

Feb 05 '15 21:02 andrewcorrigan

When will this variadic template be merged into the thrust master branch?

Jul 01 '15 14:07 schiller-manuel

I've been wondering the same thing for a long time. @jaredhoberock any word on a roadmap for C++11, including variadic tuple, support in Thrust?

Jul 01 '15 14:07 andrewcorrigan

If thrust::tuple was variadic, metaprogramming would be a lot easier. @andrewcorrigan I am currently using your variadic branch, it is really working great.

Jul 01 '15 15:07 schiller-manuel

I should mention, I haven't updated the variadic branch since Feb. 5. I am pretty sure that there have been changes and bug fixes since then in my local copy of Thrust based on the latest https://github.com/jaredhoberock/tuple. I should get you the latest version. If you don't see an update from me in the next few days, please ping me.

Jul 01 '15 16:07 andrewcorrigan

@jaredhoberock Will you entertain a pull request for variadic tuple as well?

Jul 08 '15 23:07 andrewcorrigan

Variadic tuple is a huge, risky change. Is there any evidence that it is ready or that the compiler is ready for it?

Here are some questions I'd like to see answered before contemplating how to merge variadic tuple:

What bug fixes exist which have not been merged from jaredhoberock/tuple?
How does compilation performance change after enabling variadic tuple?
How does performance of algorithms change?
Do any unit tests fail?
What is the delta in total lines of code? If the old version of tuple was replaced with the new, would there be a net decrease in code size?

Jul 09 '15 00:07 jaredhoberock

Quick response off the top of my head.

Variadic tuple is a huge, risky change.

What's risky about it? You already implemented variadic tuple, I already got it integrated. We can just leave it optional if you don't want it be the default behavior right away.

Is there any evidence that it is ready or that the compiler is ready for it?

Yes... See above. I demonstrated an example above. We are both aware of a compiler workaround to get this compiling with CUDA 7, and that the compiler bug is fixed in CUDA 7.5.

What bug fixes exist which have not been merged from jaredhoberock/tuple?

I already have them merged in my local copy. If you are interested in moving forward I will push the latest very soon (otherwise, I have limited time to work on stuff that you may or may not be interested in pursuing).

How does compilation performance change after enabling variadic tuple?

Compilation performance is almost exclusively a function of compiler frontend (clang is great, gcc is good, icpc and nvcc are terrible). Qualitatively: It's atrocious either way with nvcc. It's fine either way with clang or gcc on the host side.

How does performance of algorithms change?

My Thrust-based code sped up 50%. That is huge... I am guessing you want more fine-grained comparisons. As per my post here https://github.com/thrust/thrust/pull/488#issuecomment-71380706, myself and others I work with have tried to run the performance tests built-in to Thrust, but with no luck. Would you be willing to help with this? I can get you precise error messages and description of what we tried.

Do any unit tests fail?

As per my post here https://github.com/thrust/thrust/pull/488#issuecomment-71380706, I have not had success getting the performance / unit tests to actually run. Would you be willing to help with this? I can get you precise error messages and description of what we tried.

What is the delta in total lines of code? If the old version of tuple was replaced with the new, would there be a net decrease in code size?

Net decrease.

There is a lot of repetitive code in master due to the lack of variadic templates. See for example this commit: [https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/d000fcf7f1302cdaa7b615dc3c903fadb0cbc18f. Variadic tuple [https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/blob/variadic/thrust/detail/tuple/variadic_tuple.h] itself is ~700 lines. This replaces, for example [https://github.com/thrust/thrust/blob/master/thrust/detail/tuple.inl] which is larger.

Jul 09 '15 00:07 andrewcorrigan

If you would care to file a new issue for getting scons to work on Mac, we can discuss the problems you encountered with unit tests there.

Besides the functional correctness of the variadic tuple implementation, I am concerned about the compile time and runtime performance of algorithms which use tuples internally in their implementation. Tuple is ubiquitous within Thrust's implementation. That means that casual users of Thrust who are uninterested in using variadic tuple will be indirectly affected by them. For example, if introducing variadic tuple were to increase the compile time of, say, reduce_by_key for no additional runtime performance benefit, then it is premature to do it.

In order to be convinced that it is safe to do this merge, we need to convince ourselves that this change does no harm.

Jul 09 '15 01:07 jaredhoberock

I want to say that it's wonderful to finally get some form of feedback on all this. I was confused when I went through the trouble of getting this new significant functionality working, observed a significant performance boost on top of that, and yet there was complete silence in response.

Can you please confirm: Is the THRUST_VARIADIC_TUPLE option not viable? In other words, would you prefer that #if __cplusplus >= 201103L, then variadic tuple is used automatically? If we go with the optional approach, then there will be no risk of the performance issues you are concerned about.
I also wonder, how would you want to handle commits like (https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/d000fcf7f1302cdaa7b615dc3c903fadb0cbc18f https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/47cb40380b87a66d1a7fe230a71380337e30b3c9 https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commit/ca5873209812296100b7b4a1ee8a5d9d465ed1d6)? That code is not the tuple implementation itself, but other code in Thrust that assumes tuple of size 10. Do you want two versions of everything? If it were up to me, I would not, and would want a separate C++11 branch entirely (which is what my personal copy of Thrust is already at this point), and let the C++03 version go into maintenance mode.
What do you want to do in terms of namespace. In https://github.com/jaredhoberock/tuple you put a lot of stuff in std, even when __TUPLE_NAMESPACE is thrust. When I integrated your variadic tuple with Thrust, I changed this so everything goes in thrust, since the rest of Thrust assumes this. What approach do you want to take, put everything in Thrust? Or do you want to clean up the rest of Thrust, so that it uses, for example, std::get instead of thrust::get?

Jul 09 '15 11:07 andrewcorrigan

Yes, as discussed in #624, I do not think a new preprocessor switch is viable.
Yes, all code in Thrust which assumes there is a maximum tuple size needs to be generalized as a corollary to this change. Unless you can think of some other way to achieve the same result, it seems like there needs to be two versions of everything. I do not think a separate branch for C++11 is viable. I think I would duplicate things on a file-by-file basis. I would take the existing file names and suffix them with _cpp03.h and suffix the new stuff with _cpp11.h.
tuple needs to go in namespace thrust. We should just keep it simple for now and simply generalize Thrust's current implementation to use variadics without introducing any new functionality.

Jul 09 '15 18:07 jaredhoberock

Thanks. I will start using file suffixes to handle the differences. If there is anything else you already don't like that we haven't already discussed, or anything you discover if/when you look at the code, can you please let me know?

Aside from any possible performance issues putting things on hold, I want this code to be as ready-to-merge as possible.

Jul 09 '15 19:07 andrewcorrigan

I don't think we've discussed unit tests -- there will need to be new tests which test the functionality of tuples larger than 10 elements. These should go in the tuple-specific unit tests and also the tests for zip_iterator.

Jul 09 '15 20:07 jaredhoberock

What about tuple_io and tuple_cat? We either need to get the open pull request merged (less work on my end), or that also needs to be folded into one massive PR.

Jul 10 '15 11:07 andrewcorrigan

Let's keep the scope of this effort focused on variadic tuple for now in order to maximize the chance of its success.

Jul 10 '15 17:07 jaredhoberock

So then are you going to merge the older PR first? If you don't merge that first, I have to redo a lot of variadic tuple work since variadic tuple was done according to the new directory structure you asked me to follow in the tuple_io PR.

In terms of duplicating things on a file-by-file basis, what about thrust/detail/type_traits.h for example. That is a large file, and only the implementation of and_ and or_ needs to be done using variadic templates. Can that be done with an #ifdef like was done to implement result_of?

Jul 14 '15 21:07 andrewcorrigan

@dachziegel I just pushed to https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/tree/variadic, which includes the latest from https://github.com/jaredhoberock/tuple, as well as a workaround for a bug in CUDA 7.0. See https://github.com/andrewcorrigan/thrust-multi-permutation-iterator/commits/variadic for details. Please let me know if it works for you or not.

Jul 14 '15 21:07 andrewcorrigan

Thanks Andrew, I will try it.

Jul 14 '15 21:07 schiller-manuel

Should this also work with THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP?

I tried this code: thrust_variadic_tuple_test.cpp

It fails to compile for me using

clang++-3.7 -I/tmp/thrust-variadic -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP -std=c++11 thrust_variadic_tuple_test.cpp

However, using std::tuple (hacky #define thrust std), it compiles.

Jul 14 '15 22:07 schiller-manuel

Yes, I agree that integrating variadic tuple will require quite a bit of work. Nonetheless, I do not wish to condition this pull request on a separate tangentially related request.

The and_ & or_ traits are pretty simple -- why not create variadic versions of them which are local to the implementation of variadic tuple? Once something else requires them, they can be integrated into the type_traits.h file at that time.

Jul 14 '15 23:07 jaredhoberock

Can you please clarify what you want to do about the request you made in https://github.com/thrust/thrust/pull/488#issuecomment-42870381? Which directory structure do you want now?

If you still want the new directory structure, I am going to do a PR with just that reorganization. If not, then I am going to move the contents of detail/tuple back into detail. Either way, the impending variadic tuple PR needs to have a clean diff against master for it to be manageable.

Jul 15 '15 00:07 andrewcorrigan

That directory structure I suggested looks good to me. Please keep everything confined to this pull request. Thanks.

Jul 15 '15 00:07 jaredhoberock

Thanks for testing it. It should absolutely work with THRUST_DEVICE_SYSTEM_CPP. I was not able to reproduce the error you posted.

Here's what I did:

I removed the line std::cout << call_functor(t) << std::endl; since call_functor is not defined.
When I compiled like you did (i.e., without defining THRUST_VARIADIC_TUPLE), except with Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn), your test code compiled.

clang++ -I/Users/acorriga/code/thrust -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP -std=c++11 test.cpp

I then added -DTHRUST_VARIADIC_TUPLE to the compilation, and that compiled as well.

It's pretty hard for me to diagnose without reproducing the error. The only thing I can say is that you are getting errors in code that would be ignored if you had defined THRUST_VARIADIC_TUPLE. What happens when you actually use variadic tuple?

Jul 15 '15 00:07 andrewcorrigan

Are you sure your test code is correct? See the error:

thrust_variadic_tuple_test.cpp:43:15: note: while substituting deduced template arguments into function template 'get' [with N = 3, HT = int, TT = thrust::detail::cons<int, thrust::detail::cons<int, thrust::null_type> >]
  return test(thrust::get<Indices>(t)...);

It seems like your code is calling get<3> on a cons list (aka tuple) of length 3, which is obviously wrong.

Jul 15 '15 00:07 andrewcorrigan

Sorry for the false alarm, I thought that setting -std=c+11 was already sufficient and that THRUST_VARIADIC_TUPLE was not needed anymore.

When setting this define, my code compiles fine for me.

Jul 15 '15 04:07 schiller-manuel

It seems to me that if it works with variadic tuple, then it's by accident. get<3> of a tuple of size 3 should be an error.

Jul 15 '15 11:07 andrewcorrigan