opencv_contrib icon indicating copy to clipboard operation
opencv_contrib copied to clipboard

CUDA Toolkit 12.4.0 `tuple` incompatibility

Open runer112 opened this issue 3 months ago • 9 comments

System information (version)
  • OpenCV => 4.9.0
  • Operating System / Platform => Windows 64 Bit
  • Compiler => Visual Studio 2022
Detailed description

opencv with CUDA support cannot be built using CUDA Toolkit 12.4.0.

While CUDA Toolkit 12.3.2 uses thrust version 2.2.0 (https://docs.nvidia.com/cuda/archive/12.3.2/cuda-toolkit-release-notes/index.html), CUDA Toolkit 12.4.0 updates to thrust version 2.3.1 (https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html). In thrust version 2.3.0, the tuple implementation was replaced with a standard tuple implementaton (https://github.com/NVIDIA/cccl/pull/262). Notably, this changes the definition from a 10-parameter template to a variable-parameter template. So instead of a tuple of n items being padded out with 10 - n null types to always have 10 template parameters, it now only has n template parameters. This makes the function templates in cudev specified with 10 template parameters per tuple no longer viable for tuples not of size 10.

An example of one such function template that's no longer viable, cv::cudev::blockReduce:

https://github.com/opencv/opencv_contrib/blob/6b5142ff657ca676ab35233556b49a532e75e2b7/modules/cudev/include/opencv2/cudev/block/reduce.hpp#L68-L81

An example of an error I encounter:

[build] Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev\grid\detail/reduce.hpp(379): error : no instance of overloaded function "cv::cudev::blockReduce" matches the argument list [Z:\dev\1\opencv\out\build\user\modules\world\opencv_world.vcxproj]
[build]               argument types are: (cuda::std::__4::tuple<volatile int *, volatile int *>, cuda::std::__4::tuple<int &, int &>, int, cuda::std::__4::tuple<cv::cudev::minimum<int>, cv::cudev::maximum<int>>)
[build]                 blockReduce<BLOCK_SIZE>(smem_tuple(sminval, smaxval), tie(mymin, mymax), tid, make_tuple(minOp, maxOp));
[build]                 ^
[build]   Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/block/reduce.hpp(72): note #3327-D: candidate function template "cv::cudev::blockReduce<N,P0,P1,P2,P3,P4,P5,P6,P7,P8,P9,R0,R1,R2,R3,R4,R5,R6,R7,R8,R9,Op0,Op1,Op2,Op3,Op4,Op5,Op6,Op7,Op8,Op9>(const thrust::THRUST_200301_500_520_600_610_700_750_800_860_890_900_NS::tuple<P0, P1, P2, P3, P4, P5, P6, P7, P8, P9> &, const thrust::THRUST_200301_500_520_600_610_700_750_800_860_890_900_NS::tuple<R0, R1, R2, R3, R4, R5, R6, R7, R8, R9> &, uint, const thrust::THRUST_200301_500_520_600_610_700_750_800_860_890_900_NS::tuple<Op0, Op1, Op2, Op3, Op4, Op5, Op6, Op7, Op8, Op9> &)" failed deduction
[build]     __declspec(__device__) __forceinline void blockReduce(const tuple<P0, P1, P2, P3, P4, P5, P6, P7, P8, P9>& smem,
[build]                                               ^
[build]   Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/block/reduce.hpp(63): note #3327-D: candidate function template "cv::cudev::blockReduce<N,T,Op>(volatile T *, T &, uint, const Op &)" failed deduction
[build]     __declspec(__device__) __forceinline void blockReduce(volatile T* smem, T& val, uint tid, const Op& op)
[build]                                               ^
[build]             detected during:
[build]               instantiation of "void cv::cudev::grid_reduce_detail::MinMaxReductor<cv::cudev::grid_reduce_detail::both, src_type, work_type>::reduceGrid<BLOCK_SIZE>(work_type *, int) [with src_type=uchar, work_type=int, BLOCK_SIZE=256]" at line 412
[build]               instantiation of "void cv::cudev::grid_reduce_detail::reduce<Reductor,BLOCK_SIZE,PATCH_X,PATCH_Y,SrcPtr,ResType,MaskPtr>(SrcPtr, ResType *, MaskPtr, int, int) [with Reductor=cv::cudev::grid_reduce_detail::MinMaxReductor<cv::cudev::grid_reduce_detail::both, uchar, int>, BLOCK_SIZE=256, PATCH_X=4, PATCH_Y=4, SrcPtr=cv::cudev::GlobPtr<uchar>, ResType=int, MaskPtr=cv::cudev::WithOutMask]" at line 421
[build]               instantiation of "void cv::cudev::grid_reduce_detail::reduce<Reductor,Policy,SrcPtr,ResType,MaskPtr>(const SrcPtr &, ResType *, const MaskPtr &, int, int, cudaStream_t) [with Reductor=cv::cudev::grid_reduce_detail::MinMaxReductor<cv::cudev::grid_reduce_detail::both, uchar, int>, Policy=cv::cudev::DefaultGlobReducePolicy, SrcPtr=cv::cudev::GlobPtr<uchar>, ResType=int, MaskPtr=cv::cudev::WithOutMask]" at line 460
[build]               instantiation of "void cv::cudev::grid_reduce_detail::minMaxVal<Policy,SrcPtr,ResType,MaskPtr>(const SrcPtr &, ResType *, const MaskPtr &, int, int, cudaStream_t) [with Policy=cv::cudev::DefaultGlobReducePolicy, SrcPtr=cv::cudev::GlobPtr<uchar>, ResType=int, MaskPtr=cv::cudev::WithOutMask]" at line 206 of Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/grid/reduce.hpp
[build]               instantiation of "void cv::cudev::gridFindMinMaxVal_<Policy,SrcPtr,ResType>(const SrcPtr &, cv::cudev::GpuMat_<ResType> &, cv::cuda::Stream &) [with Policy=cv::cudev::DefaultGlobReducePolicy, SrcPtr=cv::cudev::GpuMat_<uchar>, ResType=int]" at line 349 of Z:\dev\1\opencv_contrib\modules\cudev\include\opencv2\cudev/grid/reduce.hpp
[build]               instantiation of "void cv::cudev::gridFindMinMaxVal(const SrcPtr &, cv::cudev::GpuMat_<ResType> &, cv::cuda::Stream &) [with SrcPtr=cv::cudev::GpuMat_<uchar>, ResType=int]" at line 68 of Z:\dev\1\opencv_contrib\modules\cudaarithm\src\cuda\minmax.cu
[build]               instantiation of "void <unnamed>::minMaxImpl<T,R>(const cv::cuda::GpuMat &, const cv::cuda::GpuMat &, cv::cuda::GpuMat &, cv::cuda::Stream &) [with T=uchar, R=int]" at line 92 of Z:\dev\1\opencv_contrib\modules\cudaarithm\src\cuda\minmax.cu

The first candidate but nonviable function template shown in the error message is the one linked above, which was viable and selected in previous CUDA Toolkit versions.

I think that all templates specifying 10 template parameters per tuple can be updated to work with the new tuple definition by replacing each set of 10 template parameters with a parameter pack. I think this should still be compatible with the old tuple definition, as well. For example, I think this would be a viable implementation of cv::cudev::blockReduce:

template <int N, typename... P, typename... R, class... Op>
__device__ __forceinline__ void blockReduce(const tuple<P...>& smem,
                                            const tuple<R...>& val,
                                            uint tid,
                                            const tuple<Op...>& op)
{
    block_reduce_detail::Dispatcher<N>::reductor::template reduce<
        const tuple<P...>&,
        const tuple<R...>&,
        const tuple<Op...>&>(smem, val, tid, op);
}
Steps to reproduce

Attempt to build cudev using CUDA Toolkit 12.4.0. I suspect that this error will be observed with any combination of OpenCV version, OS, platform, and compiler (that are modern enough to not encounter some other error first).

Issue submission checklist
  • [x] I report the issue, it's not a question
  • [x] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
  • [x] I updated to the latest OpenCV version and the issue is still there
  • [x] There is reproducer code and related data files: videos, images, onnx, etc

runer112 avatar Mar 08 '24 18:03 runer112

Just to confirm your suspicion that this affects cross-platform builds, getting the same errors on Linux with GCC 13:

opencv_contrib-4.9.0/modules/cudev/include/opencv2/cudev/grid/detail/reduce_to_column.hpp(73): error: no instance of overloaded function "cv::cudev::blockReduce" matches the argument list                                                 
            argument types are: (cuda::std::__4::tuple<volatile float *, volatile float *>, cuda::std::__4::tuple<float &, float &>, const unsigned int, cuda::std::__4::tuple<cv::cudev::Sum<float>, cv::cudev::Sum<float>>)                                           
              blockReduce<BLOCK_SIZE>(smem_tuple(smem[0], smem[1]), tie(myVal.x, myVal.y), threadIdx.x, make_tuple(op, op));

^ one such error

sjuxax avatar Mar 12 '24 23:03 sjuxax

I have the same issue when building latest OpenCV 4 from source with Cuda 12.4,, cudnn 9 and gcc 13, on Fedora 39

General configuration for OpenCV 4.9.0-dev =====================================
--   Version control:               4.9.0-293-g912cf2a028
-- 
--   Extra modules:
--     Location (extra):            /home/coder/projects/opencv-src/opencv_contrib/modules
--     Version control (extra):     4.9.0-51-gab821068
-- 
--   Platform:
--     Timestamp:                   2024-03-29T14:50:46Z
--     Host:                        Linux 6.7.10-200.fc39.x86_64 x86_64
--     CMake:                       3.27.7
--     CMake generator:             Unix Makefiles
--     CMake build tool:            /usr/bin/gmake
--     Configuration:               RELEASE
-- 
--   CPU/HW features:
--     Baseline:                    SSE SSE2 SSE3
--       requested:                 SSE3
--     Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
--       requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
--       SSE4_1 (18 files):         + SSSE3 SSE4_1
--       SSE4_2 (2 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
--       FP16 (1 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
--       AVX (9 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
--       AVX2 (38 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
--       AVX512_SKX (8 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX
-- 
--   C/C++:
--     Built as dynamic libs?:      YES
--     C++ standard:                11
--     C++ Compiler:                /usr/lib64/ccache/c++  (ver 13.2.1)
--     C++ flags (Release):         -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -O3 -DNDEBUG  -DNDEBUG
--     C++ flags (Debug):           -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Wnon-virtual-dtor -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wundef -Winit-self -Wpointer-arith -Wshadow -Wsign-promo -Wuninitialized -Wsuggest-override -Wno-delete-non-virtual-dtor -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -fvisibility-inlines-hidden -g  -O0 -DDEBUG -D_DEBUG
--     C Compiler:                  /usr/lib64/ccache/cc
--     C flags (Release):           -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -O3 -DNDEBUG  -DNDEBUG
--     C flags (Debug):             -fsigned-char -ffast-math -fno-finite-math-only -W -Wall -Wreturn-type -Waddress -Wsequence-point -Wformat -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wstrict-prototypes -Wundef -Winit-self -Wpointer-arith -Wshadow -Wuninitialized -Wno-comment -Wimplicit-fallthrough=3 -Wno-strict-overflow -fdiagnostics-show-option -Wno-long-long -pthread -fomit-frame-pointer -ffunction-sections -fdata-sections  -msse -msse2 -msse3 -fvisibility=hidden -g  -O0 -DDEBUG -D_DEBUG
--     Linker flags (Release):      -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
--     Linker flags (Debug):        -Wl,--exclude-libs,libippicv.a -Wl,--exclude-libs,libippiw.a   -Wl,--gc-sections -Wl,--as-needed -Wl,--no-undefined  
--     ccache:                      YES
--     Precompiled headers:         NO
--     Extra dependencies:          m pthread cudart_static dl rt nppc nppial nppicc nppidei nppif nppig nppim nppist nppisu nppitc npps cublas cudnn cufft -L/usr/local/cuda/lib64 -L/lib64
--     3rdparty dependencies:
-- 
--   OpenCV modules:
--     To be built:                 alphamat aruco bgsegm bioinspired calib3d ccalib core cudaarithm cudabgsegm cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping cudev datasets dnn dnn_objdetect dnn_superres dpm face features2d flann freetype fuzzy gapi hdf hfs highgui img_hash imgcodecs imgproc intensity_transform java line_descriptor mcc ml objdetect optflow phase_unwrapping photo plot python3 quality rapid reg rgbd saliency sfm shape signal stereo stitching structured_light superres surface_matching text tracking ts video videoio videostab viz wechat_qrcode xfeatures2d ximgproc xobjdetect xphoto
--     Disabled:                    cudacodec world
--     Disabled by dependency:      -
--     Unavailable:                 cannops cvv julia matlab ovis python2
--     Applications:                tests perf_tests apps
--     Documentation:               NO
--     Non-free algorithms:         YES
-- 
--   GUI:                           GTK3
--     GTK+:                        YES (ver 3.24.41)
--       GThread :                  YES (ver 2.78.3)
--       GtkGlExt:                  NO
--     VTK support:                 YES (ver 9.2.6)
-- 
--   Media I/O: 
--     ZLib:                        /lib64/libz.so (ver 1.2.13)
--     JPEG:                        /lib64/libjpeg.so (ver 62)
--     WEBP:                        /lib64/libwebp.so (ver encoder: 0x020f)
--     PNG:                         /lib64/libpng.so (ver 1.6.37)
--     TIFF:                        /lib64/libtiff.so (ver 42 / 4.4.0)
--     JPEG 2000:                   OpenJPEG (ver 2.5.0)
--     OpenEXR:                     OpenEXR::OpenEXR (ver 3.1.10)
--     HDR:                         YES
--     SUNRASTER:                   YES
--     PXM:                         YES
--     PFM:                         YES
-- 
--   Video I/O:
--     DC1394:                      YES (2.2.7)
--     FFMPEG:                      YES
--       avcodec:                   YES (60.31.102)
--       avformat:                  YES (60.16.100)
--       avutil:                    YES (58.29.100)
--       swscale:                   YES (7.5.100)
--       avresample:                NO
--     GStreamer:                   YES (1.22.9)
--     v4l/v4l2:                    YES (linux/videodev2.h)
-- 
--   Parallel framework:            TBB (ver 2020.3 interface 11103)
-- 
--   Trace:                         YES (with Intel ITT)
-- 
--   Other third-party libraries:
--     Intel IPP:                   2021.10.1 [2021.10.1]
--            at:                   /home/coder/projects/opencv-src/opencv/build/3rdparty/ippicv/ippicv_lnx/icv
--     Intel IPP IW:                sources (2021.10.1)
--               at:                /home/coder/projects/opencv-src/opencv/build/3rdparty/ippicv/ippicv_lnx/iw
--     VA:                          YES
--     Lapack:                      YES (/lib64/libopenblas.so)
--     Eigen:                       YES (ver 3.4.0)
--     Custom HAL:                  NO
--     Protobuf:                    build (3.19.1)
--     Flatbuffers:                 builtin/3rdparty (23.5.9)
-- 
--   NVIDIA CUDA:                   YES (ver 12.4, CUFFT CUBLAS FAST_MATH)
--     NVIDIA GPU arch:             75
--     NVIDIA PTX archs:            75
-- 
--   cuDNN:                         YES (ver 9.0.0)
-- 
--   OpenCL:                        YES (INTELVA)
--     Include path:                /home/coder/projects/opencv-src/opencv/3rdparty/include/opencl/1.2
--     Link libraries:              Dynamic load
-- 
--   Python 3:
--     Interpreter:                 /usr/bin/python3 (ver 3.12.2)
--     Libraries:                   /lib64/libpython3.12.so (ver 3.12.2)
--     Limited API:                 NO
--     numpy:                       /usr/lib64/python3.12/site-packages/numpy/core/include (ver 1.24.4)
--     install path:                lib/python3.12/site-packages/cv2/python-3.12
-- 
--   Python (for build):            /usr/bin/python3
-- 
--   Java:                          
--     ant:                         NO
--     Java:                        YES (ver 17.0.10)
--     JNI:                         /home/coder/.sdkman/candidates/java/current/include /home/coder/.sdkman/candidates/java/current/include/linux /home/coder/.sdkman/candidates/java/current/include
--     Java wrappers:               YES (JAVA)
--     Java tests:                  NO
-- 
--   Install to:                    /usr/local
-- -----------------------------------------------------------------

juls007 avatar Mar 24 '24 03:03 juls007

Having the same Issue when building latest OpenCV 4 from Source on Windows 11.

moodzunl avatar Mar 24 '24 05:03 moodzunl

I agree, this should be fixable the way you describe it. However: tuple_size would need to get replaced as well. Probably straight-forward... It is used inside templates, where the parameter types are not directly visible. Example (last line):

template <class SrcPtr, class DstPtrTuple, class OpTuple, class MaskPtr>
__global__ void transform_tuple(const SrcPtr src, DstPtrTuple dst, const OpTuple op, const MaskPtr mask, const int rows, const int cols)
{   
    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    const int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= cols || y >= rows || !mask(y, x)) 
        return;

    typename PtrTraits<SrcPtr>::value_type srcVal = src(y, x); 

    Unroll<tuple_size<DstPtrTuple>::value>::transform(srcVal, dst, op, y, x); 
}

Here one instance is compiled with DstPtrTuple=cv::cudev::ZipPtr<cuda::std::__4::tuple<cv::cudev::GlobPtr<float>, cv::cudev::GlobPtr<float>>> How can the correct template parameters be restored? Not expanding to the correct nested template would give the wrong result. Best I could come up with was:

template <typename T>
struct tuple_size {};

template <typename... P>
struct tuple_size< tuple<P...> >
{
    static const int value = sizeof...(P);
};

template <template <typename S> typename T, typename... P>
struct tuple_size< T<tuple<P...>> >
{   
    static const int value = sizeof...(P);
}; 

This seems to work for the case mentioned above. I am not sure however, if this will give correct result in all cases. Maybe someone can give some feedback? Or any ideas how this could be solved more elegantly?

HellmannM avatar Mar 26 '24 14:03 HellmannM

Alternatively, Thrust's tuple_size can be specialized for ZipPtr<tuple<...>>. Something like:

// placed at the end of modules/cudev/include/opencv2/cudev/ptr2d/zip.hpp, in the global namespace

_LIBCUDACXX_BEGIN_NAMESPACE_STD

template<class Ptr0, class Ptr1>
struct tuple_size<cv::cudev::ZipPtr<tuple<Ptr0, Ptr1>>> : tuple_size<tuple<Ptr0, Ptr1>> {};

template<class Ptr0, class Ptr1, class Ptr2>
struct tuple_size<cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2>> {};

template<class Ptr0, class Ptr1, class Ptr2, class Ptr3>
struct tuple_size<cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2, Ptr3>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2, Ptr3>> {};


template<class Ptr0, class Ptr1>
struct tuple_size<cv::cudev::ZipPtrSz<tuple<Ptr0, Ptr1>>> : tuple_size<tuple<Ptr0, Ptr1>> {};

template<class Ptr0, class Ptr1, class Ptr2>
struct tuple_size<cv::cudev::ZipPtrSz<tuple<Ptr0, Ptr1, Ptr2>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2>> {};

template<class Ptr0, class Ptr1, class Ptr2, class Ptr3>
struct tuple_size<cv::cudev::ZipPtrSz<tuple<Ptr0, Ptr1, Ptr2, Ptr3>>> : tuple_size<tuple<Ptr0, Ptr1, Ptr2, Ptr3>> {};


template<size_t N, class Ptr0, class Ptr1>
struct tuple_element<N, cv::cudev::ZipPtr<tuple<Ptr0, Ptr1>>> : tuple_element<N, tuple<Ptr0, Ptr1>> {};

template<size_t N, class Ptr0, class Ptr1, class Ptr2>
struct tuple_element<N, cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2>>> : tuple_element<N, tuple<Ptr0, Ptr1, Ptr2>> {};

template<size_t N, class Ptr0, class Ptr1, class Ptr2, class Ptr3>
struct tuple_element<N, cv::cudev::ZipPtr<tuple<Ptr0, Ptr1, Ptr2, Ptr3>>> : tuple_element<N, tuple<Ptr0, Ptr1, Ptr2, Ptr3>> {};

_LIBCUDACXX_END_NAMESPACE_STD

Thrust does this for backwards compatibility with the old style of tuples as well. It also appears that tuple_element needs to be fixed as well, so I've included that for completeness.

In addition to the parameter packing changes mentioned above, I've successfully compiled OpenCV using this method.

stefanboca avatar Mar 28 '24 18:03 stefanboca

I am on of the maintainers of the cccl libraries at NVIDIA.

We recently updated our old thrust::tuple implementation to be an alias for cuda::std::tuple. Unfortunately, when providing the necessary backfills for thrust::tuple_size to work with thrust::null_type someone (me) missed to add the final overload for a 10 element tuple. My apologies for the disruption this bug has caused here.

This has been fixed after this issue was raised here.

There are different potential ways of working around this issue in the near / mid term:

  1. You could pull in latest cccl from github and use that instead of the version packaged with the CTK 12.4. CMake should be able to find ToT with find_package(CCCL)
  2. You could provide a temporary workaround similar to the fix we employed
  3. (mid term) You should remove all explicit specializations of tuple that rely on the old 10-param tuple and simply use the number of tuple elements you actually need, aka replace thrust::tuple<T1, T2, thrust::null_type,...> with thrust::tuple<T1, T2>

miscco avatar Apr 15 '24 06:04 miscco

I am on of the maintainers of the cccl libraries at NVIDIA.

We recently updated our old thrust::tuple implementation to be an alias for cuda::std::tuple. Unfortunately, when providing the necessary backfills for thrust::tuple_size to work with thrust::null_type someone (me) missed to add the final overload for a 10 element tuple. My apologies for the disruption this bug has caused here.

This has been fixed after this issue was raised here.

There are different potential ways of working around this issue in the near / mid term:

  1. You could pull in latest cccl from github and use that instead of the version packaged with the CTK 12.4. CMake should be able to find ToT with find_package(CCCL)
  2. You could provide a temporary workaround similar to the fix we employed
  3. (mid term) You should remove all explicit specializations of tuple that rely on the old 10-param tuple and simply use the number of tuple elements you actually need, aka replace thrust::tuple<T1, T2, thrust::null_type,...> with thrust::tuple<T1, T2>

how to replace?pull and cmake? which the cmake parameters? when i use cmake .. it turns out error : Could not find libcudacxx_LIT using the following names: lit

devops-golang avatar Apr 24 '24 02:04 devops-golang

how to replace?pull and cmake? which the cmake parameters? when i use cmake .. it turns out error : Could not find libcudacxx_LIT using the following names: lit

You could use CPM like:

include(cmake/CPM.cmake)​
CPMAddPackage("gh:NVIDIA/cccl#main")
​target_link_libraries(PROJECT CCCL::CCCL)

miscco avatar Apr 24 '24 19:04 miscco

Well... Still NOT quite get it... Do we have the solution already???

Have cccl built and replaced with the default ones installed with CUDA-Toolkit 12.4??

Thanks

jiapei100 avatar Apr 29 '24 06:04 jiapei100

Well... Still NOT quite get it... Do we have the solution already???

Have cccl built and replaced with the default ones installed with CUDA-Toolkit 12.4??

Thanks

I was able to build the library using CUDA Toolkit 12.3.2 in my environment(through vcpkg). This is one way to use it.

Also, the above cccl fixes seem to be going into v2.4.0. Even the latest version of the CUDA Toolkit at this time, 12.4.1, still seems to use an unfixed cccl. Please look at the NVIDIA CUDA Toolkit Release Notes for 12.4 Update 1.

LiuToki avatar May 07 '24 04:05 LiuToki

CUDA Toolkit 12.5 still has the bug.

asmorkalov avatar May 22 '24 08:05 asmorkalov