cub cub block reductions fail to compile correctly with nvrtc for certain block sizes

Using cub's block reductions in kernels compiled using nvrtc (using Jitify), fail to compile for specific block sizes. See the error produced below, where template type deduction seems to be failing in thread_store.cuh. Curiously this only occurs for "unusual" block sizes, and more regular block sizes compile fine.

../../thread/thread_store.cuh(351): error: argument list for template "cub::IterateThreadStore<COUNT, MAX>::Dereference [with COUNT=0, MAX=VOLATILE_MULTIPLE]" is missing
          detected during:
            instantiation of "void cub::ThreadStore(T *, T, cub::Int2Type<5>, cub::Int2Type<1>) [with T=quda::complex<double>]" 
(410): here
            instantiation of "void cub::ThreadStore<MODIFIER,OutputIteratorT,T>(OutputIteratorT, T) [with MODIFIER=cub::STORE_VOLATILE, OutputIteratorT=quda::complex<double> *, T=quda::complex<double>]" 
specializations/warp_reduce_smem.cuh(149): here
            instantiation of "T cub::WarpReduceSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH>::ReduceStep<ALL_LANES_VALID,FOLDED_ITEMS_PER_LANE,ReductionOp,STEP>(T, int, ReductionOp, cub::Int2Type<STEP>) [with T=quda::complex<double>, LOGICAL_WARP_THREADS=18, PTX_ARCH=600, ALL_LANES_VALID=true, FOLDED_ITEMS_PER_LANE=1, ReductionOp=cub::Sum, STEP=0]" 
specializations/warp_reduce_smem.cuh(349): here
            instantiation of "T cub::WarpReduceSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH>::Reduce<ALL_LANES_VALID,FOLDED_ITEMS_PER_LANE,ReductionOp>(T, int, ReductionOp) [with T=quda::complex<double>, LOGICAL_WARP_THREADS=18, PTX_ARCH=600, ALL_LANES_VALID=true, FOLDED_ITEMS_PER_LANE=1, ReductionOp=cub::Sum]" 
specializations/block_reduce_warp_reductions.cuh(184): here
            instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Sum<FULL_TILE>(T, int) [with T=quda::complex<double>, BLOCK_DIM_X=9, BLOCK_DIM_Y=2, BLOCK_DIM_Z=1, PTX_ARCH=600, FULL_TILE=true]" 
cub/block/block_reduce.cuh(500): here
            instantiation of "T cub::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Sum(T) [with T=quda::complex<double>, BLOCK_DIM_X=9, ALGORITHM=cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=2, BLOCK_DIM_Z=1, PTX_ARCH=600]"

A simple patch to fix the above problem is to force the template type deduction, e.g.

--- a/include/externals/cub/thread/thread_store.cuh
+++ b/include/externals/cub/thread/thread_store.cuh
@@ -348,7 +348,7 @@ 
__device__ __forceinline__ void ThreadStoreVolatilePtr(
     for (int i = 0; i < SHUFFLE_MULTIPLE; ++i)
         reinterpret_cast<ShuffleWord*>(words)[i] = reinterpret_cast<ShuffleWord*>(&val)[i];
 
-    IterateThreadStore<0, VOLATILE_MULTIPLE>::template Dereference(
+    IterateThreadStore<0, VOLATILE_MULTIPLE>::template Dereference<volatile VolatileWord*,VolatileWord>(
         reinterpret_cast<volatile VolatileWord*>(ptr),
         words);
 }

fixes the compilation problem and the kernels executes correctly.

Whose fault this is, I don't know. It looks like this issue may be hit elsewhere in cub, but my use case doesn't extend much beyond reductions at present.

Sep 25 '19 03:09 maddyscientist

Looks like it might be an issue with arrays decaying to pointers (or not)? https://stackoverflow.com/questions/33729176/why-does-array-type-not-decay-to-pointer-for-class-templates/33734238#33734238

Sep 25 '19 04:09 maddyscientist

Did you see my email? Give the branch https://github.com/NVlabs/cub/tree/array_type_decay_issue_nvrtc a shot. (It switches from structural-template-unrolling to functional-template-unrolling)

Oct 15 '19 14:10 dumerrill

@dumerrill @maddyscientist Did anything come from the fix that Duane mentioned above?

I can't find the branch now, maybe it was lost in the reorganization :/

Oct 21 '20 17:10 alliepiper

Oops, I forgot to circle back to check on this.

Beyond my CUB code patch, I also found I could also WAR the issue in my code by rewriting my block reduction code to use converged warps (with zero padding). It was the non-converged-warp path that was was failing to compile with NVRTC.
I never got around to checking @dumerrill's patched branch to see if this also WARed the issue.

What do you suggest @allisonvacanti. Shall I test if this issue is still present with NVRTC with CUDA 11.x?

Oct 21 '20 17:10 maddyscientist

That is bizarre. It sounds like a bug in the compilation toolchain, especially since it works in some cases but not others.

Shall I test if this issue is still present with NVRTC with CUDA 11.x?

If you get a chance, that'd be great. If it's fixed, we can close this out. If not, let's make a PR with the patch you provided in the original report.

Oct 21 '20 20:10 alliepiper

cub cub copied to clipboard

cub block reductions fail to compile correctly with nvrtc for certain block sizes

cub
cub copied to clipboard