circle
circle copied to clipboard
CUDA cooperative groups support
Hi,
I was trying to get one of my favourite parts of the CUDA toolkit, the cooperative groups API, to work with circle the other day. This issue documents the errors I ran into and provisional CUDA-header-patching workarounds for them, as well as a few suggestions. It ended up being quite long, but I hope all the information I crammed in here comes across as helpful, which was certainly the intention. I'm also slightly wary of posting here, given the low number of replies, but it still seems like the best option, so here goes.
A lot of issues pop up when, say, compiling CUDA samples that use cooperative groups. Most of them are due to circle, but a few big ones seem to be on NVidia's side and compilation also fails with nvcc's younger brother nvc++. I've submitted a bug report and opened a forum topic with NVidia regarding that.
The first issue to work around with circle is the fact that using CUDA 12.2 bundled with HPC SDK 23.7 causes the compilation of any .cu
file to fail with the error message:
error: /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/include/sm_32_atomic_functions.hpp:94:43
redefinition of function long long atomicMin(long long*, long long)
<more-lines>
This is due to the removal of a check whether _NVHPC_CUDA
is defined somewhere in the sm_32_atomic_functions.h
CUDA header that was fine in a previous version of the toolkit and is the subject of my bug report to NVidia. It may be worked around without any adverse side effects by passing -D__SM_32_ATOMIC_FUNCTIONS_H__
to circle for every GPU compilation.
I'm using circle build 200 on Kali Linux running in WSL2 with gcc 12.3.0-5 and libstd++ 13.1.0-6 (somehow). I have a laptop GeForce RTX 2060 with SM level 7.5. I set the CUDA_PATH
environment variable and used the following alias to compile CUDA samples:
alias mycirc='circle --std=c++20 -D__SM_32_ATOMIC_FUNCTIONS_H__ -I../../../Common --sm_75'
Here's a chronologically ordered list of issues I ran into when compiling CG-related CUDA samples:
- Compiling binaryPartitionCG immediately fails with a segmentation fault. The following change somehow resolves that
diff --git b/include-old/cooperative_groups/details/reduce.h a/include/cooperative_groups/details/reduce.h
index 3c06df6..d483728 100755
--- b/include-old/cooperative_groups/details/reduce.h
+++ a/include/cooperative_groups/details/reduce.h
@@ -278,8 +278,9 @@ namespace details {
*warp_scratch_location =
details::reduce(warp, _CG_STL_NAMESPACE::forward<TyVal>(val), op);
};
+ using SubwarpType = details::internal_thread_block_tile<num_warps, warpType>;
auto inter_warp_lambda =
- [&] (const details::internal_thread_block_tile<num_warps, warpType>& subwarp, TyRet* thread_scratch_location) {
+ [&] (const SubwarpType& subwarp, TyRet* thread_scratch_location) {
*thread_scratch_location =
details::reduce(subwarp, *thread_scratch_location, _CG_STL_NAMESPACE::forward<TyFn>(op));
};
- The next error message that pops up is
error: TyTrunc cooperative_groups::__v1::details::vec3_to_linear(dim3, dim3)
failure during overload resolution for function TyTrunc cooperative_groups::__v1::details::vec3_to_linear(dim3, dim3)
function declared at /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/helpers.h:77:34
/opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/helpers.h:97:48
return vec3_to_linear<unsigned int>(threadIdx, blockDim);
^
error: /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/helpers.h:97:48
... included from /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups.h:57:10
... included from CudaExample.cu:60:10
cannot convert lvalue const __type_threadIdx to dim3
__type_threadIdx declared at GPU implicit declarations:15:1
dim3 declared at /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/vector_types.h:418:1
return vec3_to_linear<unsigned int>(threadIdx, blockDim);
and a similar error message with __type_blockIdx
instead of __type_threadIdx
. vec3_to_linear
expects the first param to be dim3
. The CUDA coding manual prescribes the type of threadIdx
as uint3
and nvcc sees it as such. dim3
has a non-explicit converting constructor from uint3
, so passing threadIdx
should work out of the box. However, circle sees its type as __type_threadIdx
, which is distinct from uint3
(somehow overriding the definition in <device_launch_parameters.h>
) but is implicitly convertible to uint3
. Unfortunately, being implicitly convertible is not transitive and __type_threadIdx
is not implicitly convertible to dim3
.
A workaround that allowed me to continue was to add the following implicitly converting constructors to dim3
in <vector_types.h>
:
#ifdef __circle_lang__
__host__ __device__ constexpr dim3(__type_threadIdx v) : x(v.x), y(v.y), z(v.z) {}
__host__ __device__ constexpr dim3(__type_blockIdx v) : x(v.x), y(v.y), z(v.z) {}
#endif
A slightly more robust solution would be to add a conversion-to-dim3
operator to __type_threadIdx
. But whenever someone writes a class with a converting constructor from uint3
and then wants to convert threadIdx
to it, circle will break, so a different solution would be ideal.
Inspecting the output of strings circle
hints at circle implementing the x
, y
, z
members of __type_threadIdx
as properties (neat!) that delegate to function calls. Would it be possible to implement "namespace-level properties" that delegate global-variable accesses to function calls and implement threadIdx
as an actual uint3
that way (and similarly for the other built-in variables)? Just an idea.
- With this,
binaryPartitionCG
compiles and gives the same output as when compiled with nvcc. Next, compiling reductionMultiBlockCG yields
ptxas fatal : Unresolved extern function '__trap'
<more lines>
Looks like the __trap
intrinsic is fully missing. Adding
#ifdef __circle_lang__
__host__ __device__ constexpr void __trap() {}
#endif
to the top of the sample, before the includes, fixed compilation. There seem to be many more functions in device_functions.h
, where __trap
is declared, whose implementations are missing, including some surprising ones like __expf
and the __fsub
family. The commented ones in this test program are some (but perhaps not all) of them.
- Yet another reduction sample, compiled with
mycirc -c reduction.cpp -o reduction.circ.o
mycirc -c reduction_kernel.cu -o reduction_kernel.circ.o
circle "$CUDA_PATH/lib64/libcudart.so" reduction.circ.o reduction_kernel.circ.o -o c && ./c --kernel=9
first runs into a bunch of CUDA issues, resolved by the patch I posted here. An alternative to the patch is specifying -D_CG_USER_PROVIDED_SHARED_MEMORY
. For SM level >= 8.0
, this define has to be specified regardless when compiling with circle as one runs into the following error otherwise
ptxas /tmp/circle-tmp-dir-gUUrk9/reduction_kernel-compute-80.ptx, line 5883; error : Feature '%reserved_smem_offset_1' requires PTX ISA .version 7.6 or later
- After that, the program compiles, but we get
ptxas warning : Unresolved extern variable 'warpSize' in whole program compilation, ignoring extern qualifier
I'm assuming warpSize
is seen as either 0 or some junk uninitialized value, because the program then crashes wildly with the message "Kernel execution failed : (700) an illegal memory access was encountered." But adding
#ifdef __circle_lang__
int const warpSize = 32;
#endif
before the includes in reduction_kernel.cu
again fixes everything. It seems that warpSize
is another internal symbol that circle needs to define.
- An additional issue pops up when compiling something like
cooperative_groups::this_thread_block()
with CUDA 12.0 instead of 12.2.
error: /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups/details/memory.h:64:14
... included from /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/cuda/12.0/include/cooperative_groups.h:58:10
... included from CudaExample.cu:60:10
%s must be followed by an operand digit
asm ("{\n\t"
^
The full referenced assembler statement is
asm ("{\n\t"
" .reg .u32 %start;\n\t"
" .reg .u64 %extended;\n\t"
" mov.u32 %start, %%reserved_smem_offset_1;\n\t"
" cvt.u64.u32 %extended, %start;\n\t"
" cvta.shared.u64 %0, %extended;\n\t"
"}"
: "=" _CG_ASM_PTR_CONSTRAINT(ptr));
I'm no assembler expert, but I have the impression that circle expects to find GCC extended asm syntax in asm
blocks while CUDA asm syntax, though not really documented, appears to be less restrictive. Namely, it seems to allow not escaping %
characters when they're not followed by a digit or a single letter and a digit. The expressions %start
and %extended
in the referenced asm
statement are like that and don't conform to the GCC syntax. After changing them in the header to %%start
and %%extended
, respectively, the code compiles and executes seemingly correctly with both circle and nvcc.
Rather than relaxing %
escaping rules in circle too, it's probably much better to just recommend using a newer version of the CUDA toolkit such as 12.2 where %
characters seem to be escaped more consistently.
If it's useful to anyone I'm attaching a diff of all the changes I had to make to the 12.2 CUDA headers to be able to use cooperative groups comfortably (for now). It extends the patch from here with circle-specific additions.
Cheers, Mat
Is there any progress with this? I have a C++/CUDA project which uses g++ and nvcc compilers. I use heavily grid synchronization this_grid.sync()
in the code. I was thinking about using circle in the project, but then found this thread...