ucc
ucc copied to clipboard
Compilation of GPU kernel code generates warnings
When compiling ucc with support for GPUs, two different compilers might be used: the compiler used for compiling the host code (e.g. gcc), and the compiler used for compiling kernel code (e.g. hipcc, nvcc). The two compilers do not necessarily have identical features sets. Configure at the moment only captures the capabilities and features of the host compiler. This can lead to some warning when compiling the GPU kernel code.
One example is shown below, where hipcc does not recognize the attribute "optimize" that is set in arch/cpu.h
/usr/bin/hipcc -c ec_rocm_reduce.cu -I/home/egabriel/UCX/include/ -D__HIP_PLATFORM_AMD__ -I/opt/rocm/include/hip -I/opt/rocm/include -I/opt/rocm/include/hsa -I/opt/rocm/include -I/home/egabriel/ucc -I/home/egabriel/ucc -I/home/egabriel/ucc/src -I/home/egabriel/ucc/src -I/home/egabriel/ucc/src/components/ec/rocm -fPIC -o ./.libs/ec_rocm_reduce.o
In file included from ec_rocm_reduce.cu:8:
In file included from /home/egabriel/ucc/src/components/ec/rocm/ec_rocm.h:11:
In file included from /home/egabriel/ucc/src/components/ec/base/ucc_ec_base.h:11:
In file included from /home/egabriel/ucc/src/utils/ucc_component.h:12:
In file included from /home/egabriel/ucc/src/utils/ucc_parser.h:16:
In file included from /home/egabriel/ucc/src/utils/arch/cpu.h:102:
/home/egabriel/ucc/src/utils/arch/x86_64/cpu.h:26:43: warning: unknown attribute 'optimize' ignored [-Wunknown-attributes]
ucc_cpu_model_t ucc_arch_get_cpu_model() UCC_F_NOOPTIMIZE;
^~~~~~~~~~~~~~~~
/home/egabriel/ucc/src/utils/ucc_compiler_def.h:53:41: note: expanded from macro 'UCC_F_NOOPTIMIZE'
#define UCC_F_NOOPTIMIZE __attribute__((optimize("O0")))
Which version of ROCm? Didn't see this with 5.2.3. I used
$ ../configure --prefix=/cm/shared/apps/ucc/1.2.0 --with-avx --with-sse42 --with-ucx=/cm/shared/apps/ucx/1.14.1 --with-cuda=/cm/shared/apps/cuda11.8/toolkit/11.8.0 --with-nccl --with-profiling --with-rocm=/cm/shared/apps/amd/rocm/5.2.3 --with-rccl
Its a very fundamental problem that was also discussed in the UCC developers meetings. I can't recall a ROCm version where I did not see this warning, but the most recent ones I have used it against include 5.4.3, 5.5.1, and 5.6.0
I see this with AMD ROCM 5.7.1 and ucc 1.2.0 with config
../configure --prefix=/cm/shared/apps/ucc/1.2.0 --with-avx2 --with-sse42 --with-ucx --with-cuda=/cm/shared/apps/cuda12.3/toolkit/12.3.2 --with-nccl --with-profiling --with-valgrind --with-avx --with-rocm=/cm/shared/apps/amd/rocm/5.7.1 --with-mpi=/cm/shared/apps/openmpi4-cuda11.8-ofed5-gcc11/4.1.4 --enable-gtest
and snippet
/cm/shared/apps/amd/rocm/5.7.1/bin/hipcc -c ../../../../../../src/components/ec/rocm/kernel/ec_rocm_executor_kernel.cu -D__HIP_PLATFORM_AMD__ -I/cm/shared/apps/amd/rocm/5.7.1/include/hip -I/cm/shared/apps/amd/rocm/5.7.1/include -I/cm/shared/apps/amd/rocm/5.7.1/include/hsa -I/cm/shared/apps/amd/rocm/5.7.1/include -I/home/torel/workspace/UCC/ucc-1.2.0/Build-x86_64 -I/home/torel/workspace/UCC/ucc-1.2.0 -I/home/torel/workspace/UCC/ucc-1.2.0/src -I/home/torel/workspace/UCC/ucc-1.2.0/Build-x86_64/src -I/home/torel/workspace/UCC/ucc-1.2.0/src/components/ec/rocm -fPIC -o ./.libs/ec_rocm_executor_kernel.o In file included from /home/torel/workspace/UCC/ucc-1.2.0/src/utils/arch/rocm_def.h:16, from /home/torel/workspace/UCC/ucc-1.2.0/src/components/ec/rocm/ec_rocm.h:15, from ../../../../../../src/components/ec/rocm/kernel/ec_rocm_executor_kernel.cu:8: /cm/shared/apps/amd/rocm/5.7.1/include/hip/hip_runtime_api.h:8486:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); 8486 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); | ^~~~~ In file included from /home/torel/workspace/UCC/ucc-1.2.0/src/utils/arch/rocm_def.h:16, from /home/torel/workspace/UCC/ucc-1.2.0/src/components/ec/rocm/ec_rocm.h:15, from ../../../../../../src/components/ec/rocm/kernel/ec_rocm_reduce.cu:8: /cm/shared/apps/amd/rocm/5.7.1/include/hip/hip_runtime_api.h:8486:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); 8486 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__"); | ^~~~~
Is there any way to get around this?
I am confused: you are talking about ROCm 5.7.1 but than you set --with-cuda=... (instead of --with-rocm=...) and -with-nccl ( insetead of --with-rccl=...). cuda/nccl are for NVidia GPUs, rocm/rccl are for AMD GPUs, you cannot interchange them.
The output that you get is an actual error (not a warning)