cub icon indicating copy to clipboard operation
cub copied to clipboard

Compilation error with histogram test program

Open anshumang opened this issue 5 years ago • 7 comments

I am trying to compile the following CUB test program -

#include <cub/device/device_histogram.cuh>

int main()
{
        uint32_t* d_samples;
        uint8_t* d_histogram;
        uint8_t* d_levels;
        size_t temp_storage_bytes;
        int num_levels;
        size_t num_samples;
        cub::DeviceHistogram::HistogramRange(nullptr,
                        temp_storage_bytes,
                        d_samples,
                        d_histogram,
                        num_levels,
                        d_levels,
                        num_samples);
        return 0;
}

using the following command line - /usr/local/cuda-11.0/bin/nvcc cubDeviceHistogramTest.cu -c -o cubDeviceHistogramTest.o -ccbin /usr/bin/cc -m64 -Xcompiler ,\"-Wall\",\"-Wextra\",\"-Werror\",\"-fPIC\",\"-g\" -std=c++14 --expt-extended-lambda --expt-relaxed-constexpr -lineinfo --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_70,code=sm_70 -DNVCC -I/usr/local/cuda-11.0/include

but I see the following error -

/usr/local/cuda-11.0/include/cub/device/dispatch/../../agent/agent_histogram.cuh(314): error: no instance of overloaded function "atomi
cAdd" matches the argument list                                                                                                        
            argument types are: (uint8_t *, uint8_t)                                                                                  
          detected during:                                                                                                            
            instantiation of "void cub::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS,
SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::StoreOutput(CounterT **) [with AgentHistogramPolic
yT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, N
UM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t
, uint8_t, int>::SearchTransform<uint8_t *>, OutputDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTr
ansform, OffsetT=int, PTX_ARCH=700]"                                                                                                  
(329): here                                                                                                                            
            instantiation of "void cub::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS,
SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::StoreSmemOutput() [with AgentHistogramPolicyT=cub:
:DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, NUM_ACTI
VE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8
_t, int>::SearchTransform<uint8_t *>, OutputDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTransform
, OffsetT=int, PTX_ARCH=700]"                                                                                                          
(774): here                                                                                                                            
            instantiation of "void cub::AgentHistogram<AgentHistogramPolicyT, PRIVATIZED_SMEM_BINS, NUM_CHANNELS, NUM_ACTIVE_CHANNELS,
SampleIteratorT, CounterT, PrivatizedDecodeOpT, OutputDecodeOpT, OffsetT, PTX_ARCH>::StoreOutput() [with AgentHistogramPolicyT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, NUM_ACTIVE_$HANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t,
int>::SearchTransform<uint8_t *>, OutputDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTransform, O$fsetT=int, PTX_ARCH=700]"
/usr/local/cuda-11.0/include/cub/device/dispatch/dispatch_histogram.cuh(154): here
            instantiation of "void cub::DeviceHistogramSweepKernel<AgentHistogramPolicyT,PRIVATIZED_SMEM_BINS,NUM_CHANNELS,NUM_ACTIVE_$HANNELS,SampleIteratorT,CounterT,PrivatizedDecodeOpT,OutputDecodeOpT,OffsetT>(SampleIteratorT, cub::ArrayWrapper<int, NUM_ACTIVE_CHANN$LS>, cub::ArrayWrapper<int, NUM_ACTIVE_CHANNELS>, cub::ArrayWrapper<CounterT *, NUM_ACTIVE_CHANNELS>, cub::ArrayWrapper<CounterT *, NU$_ACTIVE_CHANNELS>, cub::ArrayWrapper<OutputDecodeOpT, NUM_ACTIVE_CHANNELS>, cub::ArrayWrapper<PrivatizedDecodeOpT, NUM_ACTIVE_CHANNELS$, OffsetT, OffsetT, OffsetT, int, cub::GridQueue<int>) [with AgentHistogramPolicyT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uin
t8_t, int>::PtxHistogramSweepPolicy, PRIVATIZED_SMEM_BINS=256, NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, Count
erT=uint8_t, PrivatizedDecodeOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::SearchTransform<uint8_t *>, OutputDeco
deOpT=cub::DipatchHistogram<1, 1, uint32_t *, uint8_t, uint8_t, int>::PassThruTransform, OffsetT=int]"
/usr/local/cuda-11.0/include/cub/device/dispatch/dispatch_histogram.cuh(792): here
            instantiation of "cudaError_t cub::DipatchHistogram<NUM_CHANNELS, NUM_ACTIVE_CHANNELS, SampleIteratorT, CounterT, LevelT, O
ffsetT>::DispatchRange(void *, size_t &, SampleIteratorT, CounterT **, int *, LevelT **, OffsetT, OffsetT, OffsetT, cudaStream_t, __nv_
bool, cub::Int2Type<0>) [with NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, LevelT=uint8_t, Offs
etT=int]"
/usr/local/cuda-11.0/include/cub/device/device_histogram.cuh(849): here
            instantiation of "cudaError_t cub::DeviceHistogram::MultiHistogramRange<NUM_CHANNELS,NUM_ACTIVE_CHANNELS,SampleIteratorT,Co
unterT,LevelT,OffsetT>(void *, size_t &, SampleIteratorT, CounterT **, int *, LevelT **, OffsetT, OffsetT, size_t, cudaStream_t, __nv_b
ool) [with NUM_CHANNELS=1, NUM_ACTIVE_CHANNELS=1, SampleIteratorT=uint32_t *, CounterT=uint8_t, LevelT=uint8_t, OffsetT=int]"
/usr/local/cuda-11.0/include/cub/device/device_histogram.cuh(557): here
            instantiation of "cudaError_t cub::DeviceHistogram::HistogramRange(void *, size_t &, SampleIteratorT, CounterT *, int, Leve
lT *, OffsetT, cudaStream_t, __nv_bool) [with SampleIteratorT=uint32_t *, CounterT=uint8_t, LevelT=uint8_t, OffsetT=int]"
cubDeviceHistogramTest.cu(14): here

Looks like this is a bug.

anshumang avatar Jul 15 '20 01:07 anshumang

@anshumang Does it help if you change num_samples from size_t to int? If so, this is indeed a bug and will be fixed in https://github.com/thrust/cub/pull/38.

leofang avatar Jul 20 '20 17:07 leofang

No, it does not. @leofang

anshumang avatar Jul 20 '20 19:07 anshumang

Hi @anshumang Sorry I dropped the ball. Actually I was being silly. In your case the error log seems clear: There's no atomicAdd defined for (uint8_t *, uint8_t). All of the versions that CUDA provides are listed here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd. It's needed to atomically add the counts to the corresponding bin.

One very quick and dirty hack is upcast to unsigned int, which has an atomicAdd defined, and then downcast:

uint8_t atomicAdd(uint8_t* address, uint8_t val) {
    return (uint8_t)atomicAdd(reinterpret_cast<unsigned int*>address, static_cast<unsigned int>val);
}

This definition must show up before including device_histogram.cuh. However, there might be a performance impact for the upcast/downcast cycle, so some benchmarks are necessary.

leofang avatar Aug 04 '20 21:08 leofang

@anshumang Is your issue resolved?

leofang avatar Sep 09 '20 14:09 leofang

We should track down why we're trying to call atomics that don't exist here, and either error out or switch to a fallback implementation.

Trying to fake the atomicAdd as mentioned above will give an incorrect result (the wrong value will be read/written), and address is not likely to be aligned property for an atomic uint32_t access. To fake the atomic update, you'd need to

  1. load an aligned word
  2. do some shifts/masks to extract the bits you want from the word
  3. do the addition
  4. Replace the bits in the original word in (1) with the result
  5. Do a CAS operation to write back the memory, falling back to (1) if needed.

alliepiper avatar Oct 21 '20 18:10 alliepiper

This should be investigated and fixed when we address #227.

alliepiper avatar Nov 17 '20 21:11 alliepiper

Verified that this still happens with 1.16.0: https://www.godbolt.org/z/zGMv4enj8

alliepiper avatar May 09 '22 20:05 alliepiper