cub
cub copied to clipboard
cub block reductions fail to compile correctly with nvrtc for certain block sizes
Using cub's block reductions in kernels compiled using nvrtc (using Jitify), fail to compile for specific block sizes. See the error produced below, where template type deduction seems to be failing in thread_store.cuh. Curiously this only occurs for "unusual" block sizes, and more regular block sizes compile fine.
../../thread/thread_store.cuh(351): error: argument list for template "cub::IterateThreadStore<COUNT, MAX>::Dereference [with COUNT=0, MAX=VOLATILE_MULTIPLE]" is missing
detected during:
instantiation of "void cub::ThreadStore(T *, T, cub::Int2Type<5>, cub::Int2Type<1>) [with T=quda::complex<double>]"
(410): here
instantiation of "void cub::ThreadStore<MODIFIER,OutputIteratorT,T>(OutputIteratorT, T) [with MODIFIER=cub::STORE_VOLATILE, OutputIteratorT=quda::complex<double> *, T=quda::complex<double>]"
specializations/warp_reduce_smem.cuh(149): here
instantiation of "T cub::WarpReduceSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH>::ReduceStep<ALL_LANES_VALID,FOLDED_ITEMS_PER_LANE,ReductionOp,STEP>(T, int, ReductionOp, cub::Int2Type<STEP>) [with T=quda::complex<double>, LOGICAL_WARP_THREADS=18, PTX_ARCH=600, ALL_LANES_VALID=true, FOLDED_ITEMS_PER_LANE=1, ReductionOp=cub::Sum, STEP=0]"
specializations/warp_reduce_smem.cuh(349): here
instantiation of "T cub::WarpReduceSmem<T, LOGICAL_WARP_THREADS, PTX_ARCH>::Reduce<ALL_LANES_VALID,FOLDED_ITEMS_PER_LANE,ReductionOp>(T, int, ReductionOp) [with T=quda::complex<double>, LOGICAL_WARP_THREADS=18, PTX_ARCH=600, ALL_LANES_VALID=true, FOLDED_ITEMS_PER_LANE=1, ReductionOp=cub::Sum]"
specializations/block_reduce_warp_reductions.cuh(184): here
instantiation of "T cub::BlockReduceWarpReductions<T, BLOCK_DIM_X, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Sum<FULL_TILE>(T, int) [with T=quda::complex<double>, BLOCK_DIM_X=9, BLOCK_DIM_Y=2, BLOCK_DIM_Z=1, PTX_ARCH=600, FULL_TILE=true]"
cub/block/block_reduce.cuh(500): here
instantiation of "T cub::BlockReduce<T, BLOCK_DIM_X, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Sum(T) [with T=quda::complex<double>, BLOCK_DIM_X=9, ALGORITHM=cub::BLOCK_REDUCE_WARP_REDUCTIONS, BLOCK_DIM_Y=2, BLOCK_DIM_Z=1, PTX_ARCH=600]"
A simple patch to fix the above problem is to force the template type deduction, e.g.
--- a/include/externals/cub/thread/thread_store.cuh
+++ b/include/externals/cub/thread/thread_store.cuh
@@ -348,7 +348,7 @@
__device__ __forceinline__ void ThreadStoreVolatilePtr(
for (int i = 0; i < SHUFFLE_MULTIPLE; ++i)
reinterpret_cast<ShuffleWord*>(words)[i] = reinterpret_cast<ShuffleWord*>(&val)[i];
- IterateThreadStore<0, VOLATILE_MULTIPLE>::template Dereference(
+ IterateThreadStore<0, VOLATILE_MULTIPLE>::template Dereference<volatile VolatileWord*,VolatileWord>(
reinterpret_cast<volatile VolatileWord*>(ptr),
words);
}
fixes the compilation problem and the kernels executes correctly.
Whose fault this is, I don't know. It looks like this issue may be hit elsewhere in cub, but my use case doesn't extend much beyond reductions at present.
Looks like it might be an issue with arrays decaying to pointers (or not)? https://stackoverflow.com/questions/33729176/why-does-array-type-not-decay-to-pointer-for-class-templates/33734238#33734238
Did you see my email? Give the branch https://github.com/NVlabs/cub/tree/array_type_decay_issue_nvrtc a shot. (It switches from structural-template-unrolling to functional-template-unrolling)
@dumerrill @maddyscientist Did anything come from the fix that Duane mentioned above?
I can't find the branch now, maybe it was lost in the reorganization :/
Oops, I forgot to circle back to check on this.
- Beyond my CUB code patch, I also found I could also WAR the issue in my code by rewriting my block reduction code to use converged warps (with zero padding). It was the non-converged-warp path that was was failing to compile with NVRTC.
- I never got around to checking @dumerrill's patched branch to see if this also WARed the issue.
What do you suggest @allisonvacanti. Shall I test if this issue is still present with NVRTC with CUDA 11.x?
That is bizarre. It sounds like a bug in the compilation toolchain, especially since it works in some cases but not others.
Shall I test if this issue is still present with NVRTC with CUDA 11.x?
If you get a chance, that'd be great. If it's fixed, we can close this out. If not, let's make a PR with the patch you provided in the original report.