Compilation error with histogram test program
I am trying to compile the following CUB test program -
#include <cub/device/device_histogram.cuh>
int main()
{
uint32_t* d_samples;
uint8_t* d_histogram;
uint8_t* d_levels;
size_t temp_storage_bytes;
int num_levels;
size_t num_samples;
cub::DeviceHistogram::HistogramRange(nullptr,
temp_storage_bytes,
d_samples,
d_histogram,
num_levels,
d_levels,
num_samples);
return 0;
}
using the following command line -
/usr/local/cuda-11.0/bin/nvcc cubDeviceHistogramTest.cu -c -o cubDeviceHistogramTest.o -ccbin /usr/bin/cc -m64 -Xcompiler ,\"-Wall\",\"-Wextra\",\"-Werror\",\"-fPIC\",\"-g\" -std=c++14 --expt-extended-lambda --expt-relaxed-constexpr -lineinfo --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_70,code=sm_70 -DNVCC -I/usr/local/cuda-11.0/include
but I see the following error -
/usr/local/cuda-11.0/include/cub/device/dispatch/../../agent/agent_histogram.cuh(314): error: no instance of overloaded function "atomi
cAdd" matches the argument list
argument types are: (uint8_t *, uint8_t)
detected during:
instantiation of "void cub::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS,
SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::StoreOutput(CounterT **) [with AgentHistogramPolic
yT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, N
UM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t
, uint8_t, int>::SearchTransform<uint8_t *>, OutputDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTr
ansform, OffsetT=int, PTX_ARCH=700]"
(329): here
instantiation of "void cub::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS,
SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::StoreSmemOutput() [with AgentHistogramPolicyT=cub:
:DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, NUM_ACTI
VE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8
_t, int>::SearchTransform<uint8_t *>, OutputDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTransform
, OffsetT=int, PTX_ARCH=700]"
(774): here
instantiation of "void cub::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS,
SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::StoreOutput() [with AgentHistogramPolicyT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, NUM_ACTIVE_$HANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t,
int>::SearchTransform<uint8_t *>, OutputDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTransform, O$fsetT=int, PTX_ARCH=700]"
/usr/local/cuda-11.0/include/cub/device/dispatch/dispatch_histogram.cuh(154): here
instantiation of "void cub::DeviceHistogramSweepKernel<AgentHistogramPolicyT,PRIVATIZED_SMEM_BINS,NUM_CHANNELS,NUM_ACTIVE_$HANNELS,SampleIteratorT,CounterT,PrivatizedDecodeOpT,OutputDecodeOpT,OffsetT>(SampleIteratorT, cub::ArrayWrapper<int, NUM_ACTIVE_CHANN$LS>, cub::ArrayWrapper<int, NUM_ACTIVE_CHANNELS>, cub::ArrayWrapper<CounterT *, NUM_ACTIVE_CHANNELS>, cub::ArrayWrapper<CounterT *, NU$_ACTIVE_CHANNELS>, cub::ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS>, cub::ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS$, OffsetT, OffsetT, OffsetT, int, cub::GridQueue<int>) [with AgentHistogramPolicyT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uin
t8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, Count
erT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::SearchTransform<uint8_t *>, OutputDeco
deOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTransform, OffsetT=int]"
/usr/local/cuda-11.0/include/cub/device/dispatch/dispatch_histogram.cuh(792): here
instantiation of "cudaError_t cub::DipatchHistogram<NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, LevelT, O
ffsetT>::DispatchRange(void *, size_t &, SampleIteratorT, CounterT **, int *, LevelT **, OffsetT, OffsetT, OffsetT, cudaStream_t, __nv_
bool, cub::Int2Type<0>) [with NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, LevelT=uint8_t, Offs
etT=int]"
/usr/local/cuda-11.0/include/cub/device/device_histogram.cuh(849): here
instantiation of "cudaError_t cub::DeviceHistogram::MultiHistogramRange<NUM_CHANNELS,NUM_ACTIVE_CHANNELS,SampleIteratorT,Co
unterT,LevelT,OffsetT>(void *, size_t &, SampleIteratorT, CounterT **, int *, LevelT **, OffsetT, OffsetT, size_t, cudaStream_t, __nv_b
ool) [with NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, LevelT=uint8_t, OffsetT=int]"
/usr/local/cuda-11.0/include/cub/device/device_histogram.cuh(557): here
instantiation of "cudaError_t cub::DeviceHistogram::HistogramRange(void *, size_t &, SampleIteratorT, CounterT *, int, Leve
lT *, OffsetT, cudaStream_t, __nv_bool) [with SampleIteratorT=uint32_t *, CounterT=uint8_t, LevelT=uint8_t, OffsetT=int]"
cubDeviceHistogramTest.cu(14): here
Looks like this is a bug.
@anshumang Does it help if you change num_samples from size_t to int? If so, this is indeed a bug and will be fixed in https://github.com/thrust/cub/pull/38.
No, it does not. @leofang
Hi @anshumang Sorry I dropped the ball. Actually I was being silly. In your case the error log seems clear: There's no atomicAdd defined for (uint8_t *, uint8_t). All of the versions that CUDA provides are listed here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd. It's needed to atomically add the counts to the corresponding bin.
One very quick and dirty hack is upcast to unsigned int, which has an atomicAdd defined, and then downcast:
uint8_t atomicAdd(uint8_t* address, uint8_t val) {
return (uint8_t)atomicAdd(reinterpret_cast<unsigned int*>address, static_cast<unsigned int>val);
}
This definition must show up before including device_histogram.cuh. However, there might be a performance impact for the upcast/downcast cycle, so some benchmarks are necessary.
@anshumang Is your issue resolved?
We should track down why we're trying to call atomics that don't exist here, and either error out or switch to a fallback implementation.
Trying to fake the atomicAdd as mentioned above will give an incorrect result (the wrong value will be read/written), and address is not likely to be aligned property for an atomic uint32_t access. To fake the atomic update, you'd need to
- load an aligned word
- do some shifts/masks to extract the bits you want from the word
- do the addition
- Replace the bits in the original word in (1) with the result
- Do a CAS operation to write back the memory, falling back to (1) if needed.
This should be investigated and fixed when we address #227.
Verified that this still happens with 1.16.0: https://www.godbolt.org/z/zGMv4enj8