Describe the bug
Errors are triggered when compiling the cuda files of transformer_inference on ROCm.
The hip process runs without error (e.g. layer_norm.cu -> layer_norm.hip)
however, when compiling the layer_norm.cuda.o and other related cuda files, these errors jumps out:
`Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 22
[1/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.hip -o layer_norm.cuda.o
FAILED: layer_norm.cuda.o
/opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.hip -o layer_norm.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.hip:8:
In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/conversion_utils_hip.h:9:
In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/ds_kernel_utils_hip.h:24:
In file included from /opt/rocm-5.4.2/include/hip/hip_cooperative_groups.h:38:
/opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:527:3: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tile size is either not a power of 2 or greater than the wavefront size
static_assert(is_valid_tile_size::value,
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:563:39: note: in instantiation of template class 'cooperative_groups::thread_block_tile_base<64>' requested here
class thread_block_tile_type : public thread_block_tile_base<tileSize>, public tiled_group {
^
/opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:624:43: note: in instantiation of template class 'cooperative_groups::thread_block_tile_type<64>' requested here
class thread_block_tile_internal : public thread_block_tile_type<size, ParentCGTy> {
^
/opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:650:46: note: in instantiation of template class 'cooperative_groups::impl::thread_block_tile_internal<64, void>' requested here
class thread_block_tile<size, void> : public impl::thread_block_tile_internal<size, void> {
^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:346:44: note: in instantiation of template class 'cooperative_groups::thread_block_tile<64, void>' requested here
data[0] = element<Op>(data[0], warp.shfl_xor(data[0], i));
^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:355:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[0] = element<Op1>(data[0], warp.shfl_xor(data[0], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:356:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[1] = element<Op2>(data[1], warp.shfl_xor(data[1], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:365:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[0] = element<Op1>(data[0], warp.shfl_xor(data[0], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:366:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[1] = element<Op2>(data[1], warp.shfl_xor(data[1], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:367:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[2] = element<Op3>(data[2], warp.shfl_xor(data[2], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:376:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[0] = element<Op1>(data[0], warp.shfl_xor(data[0], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:377:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[1] = element<Op2>(data[1], warp.shfl_xor(data[1], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:378:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[2] = element<Op3>(data[2], warp.shfl_xor(data[2], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:379:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>'
data[3] = element<Op4>(data[3], warp.shfl_xor(data[3], i));
~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:415:18: error: no member named 'meta_group_size' in 'cooperative_groups::thread_block_tile<64, void>'
if (warp_arg.meta_group_size() > 1 && total_warps != 1) {
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:416:22: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>'
if (warp_arg.thread_rank() == 0) {
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:420:54: error: no member named 'meta_group_rank' in 'cooperative_groups::thread_block_tile<64, void>'
reduce_buffer + elems * warp_arg.meta_group_rank() + i, data + i);
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:427:22: error: no member named 'meta_group_rank' in 'cooperative_groups::thread_block_tile<64, void>'
if (warp_arg.meta_group_rank() == 0) {
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:428:26: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>'
if (warp_arg.thread_rank() < warp_arg.meta_group_size()) {
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:428:51: error: no member named 'meta_group_size' in 'cooperative_groups::thread_block_tile<64, void>'
if (warp_arg.thread_rank() < warp_arg.meta_group_size()) {
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:432:68: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>'
data + i, reduce_buffer + elems * warp_arg.thread_rank() + i);
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:442:82: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>'
mem_access::store_shared(reduce_buffer + elems * warp_arg.thread_rank() + i,
~~~~~~~~ ^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:453:69: error: no member named 'meta_group_rank' in 'cooperative_groups::thread_block_tile<64, void>'
reduce_buffer + warp_arg.meta_group_rank() * elems + i);
~~~~~~~~ ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx1030.
[2/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip -o apply_rotary_pos_emb.cuda.o
FAILED: apply_rotary_pos_emb.cuda.o
/opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip -o apply_rotary_pos_emb.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:154:48: error: use of undeclared identifier 'shfl_sync'
auto q_rot_tmp = lane < half_dim ? shfl_sync(mask[lane], q_rot, lane + half_dim)
^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:155:48: error: use of undeclared identifier 'shfl_sync'
: shfl_sync(mask[lane], q_rot, lane - half_dim);
^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:156:48: error: use of undeclared identifier 'shfl_sync'
auto k_rot_tmp = lane < half_dim ? shfl_sync(mask[lane], k_rot, lane + half_dim)
^
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:157:48: error: use of undeclared identifier 'shfl_sync'
: shfl_sync(mask[lane], k_rot, lane - half_dim);
^
4 errors generated when compiling for gfx1030.
[3/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS -U__HIP_NO_HALF_CONVERSIONS -U__HIP_NO_HALF2_OPERATORS -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.hip -o transform.cuda.o
FAILED: transform.cuda.o
/opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS -U__HIP_NO_HALF_CONVERSIONS -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.hip -o transform.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.hip:745:6: error: function template partial specialization is not allowed
void launch_transform4d_0213<T>(T* out,
^ ~~~
1 error generated when compiling for gfx1030.
[4/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.hip -o relu.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
[5/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.hip -o dequantize.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
[6/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.hip -o gelu.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
[7/9] c++ -MMD -MF pt_binding_hip.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp -o pt_binding_hip.o -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1
FAILED: pt_binding_hip.o
c++ -MMD -MF pt_binding_hip.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp -o pt_binding_hip.o -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1
In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp:11:
/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes/inference_context_hip.h:14:10: fatal error: rocblas/rocblas.h: No such file or directory
14 | #include "rocblas/rocblas.h"
| ^~~~~~~~~~~~~~~~~~~
compilation terminated.
[8/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS_=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.hip -o softmax.cuda.o
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
ninja: build stopped: subcommand failed.
`
To Reproduce
Steps to reproduce the behavior:
- Any demo code with deepspeed + huggingface config can trigger this. (replace_with_kernel_inject=True)
- Environment: ROCm 5.4.2, transformers 4.28.1
Expected behavior
I expected it to run as in the git CI on Mi100.
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn is not compatible with ROCM
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/proj/rdi/staff/ethany/anaconda3/envs/deepspeed_amd_new/lib/python3.7/site-packages/torch']
torch version .................... 1.13.1+rocm5.2
deepspeed install path ........... ['/proj/rdi/staff/ethany/anaconda3/envs/deepspeed_amd_new/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.9.2+unknown, unknown, unknown
torch cuda version ............... None
torch hip version ................ 5.2.21151-afdc89f8
nvcc version ..................... None
deepspeed wheel compiled w. ...... torch 1.13, hip 5.2
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 20.04
- GPU: 1 machine 8x Mi100 GPU
- deepspeed 0.9.2 and master
- transformers=4.28.1
- Python: 3.7
Docker context
None
Additional context
None
I am seeing and trying to debug the same errors. There are two types of errors above.
The first one is about an invalid tile size:
In file included from csrc/quantization/dequantize.hip:8:
In file included from csrc/includes/dequantization_utils_hip.h:7:
In file included from csrc/includes/conversion_utils_hip.h:9:
In file included from csrc/includes/ds_kernel_utils_hip.h:24:
In file included from /opt/rocm-5.4.2/hip/include/hip/hip_cooperative_groups.h:26:
In file included from /opt/rocm-5.4.2/hip/include/hip/../../../include/hip/hip_cooperative_groups.h:38:
In file included from /opt/rocm-5.4.2/hip/include/hip/amd_detail/amd_hip_cooperative_groups.h:26:
/opt/rocm-5.4.2/hip/include/hip/amd_detail/../../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:527:3: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tile size is either not a power of 2 or greater than the wavefront size
static_assert(is_valid_tile_size<size>::value,
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
That appears to be triggered from this check:
https://github.com/ROCm-Developer-Tools/hipamd/blob/474e8620099a463ad2ced821ae7400609b29bf7f/include/hip/amd_detail/hip_cooperative_groups_helper.h#L53-L73
As it also is in my case, the compile line above shows that hipcc is building for a set of gfx architectures:
--amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030
The gfx1030 entry seems to be problematic for this first error, apparently because it uses a wavefront size of 32 instead of 64. If one does not have that GPU, one work around is to disable that target. For example, I have gfx90a and I can create a build for just gfx90a with something like the following:
export PYTORCH_ROCM_ARCH=gfx90a
DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 python3 setup.py bdist_wheel
That probably requires one to build the DeepSpeed ops upfront.
I don't have a work around yet for the second type of error regarding the missing member functions.
@adammoody
Thanks, I wonder how did they pass the CI test for this on Mi100
I'm not sure. Perhaps this is using a wheel with pre-built ops. I did find that I could build a wheel for AMD by disabling a number of unsupported ops.
DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_QUANTIZER=0 DS_BUILD_RANDOM_LTD=0 DS_BUILD_TRANSFORMER_INFERENCE=0 python3 setup.py bdist_wheel
AMD is actively working to implement support for those.
https://github.com/microsoft/DeepSpeed/issues/3091#issuecomment-1572701712
Hi @adammoody and @yuchen2580 - not all DeepSpeed ops currently build on DeepSpeed, though the MI200 CI test is broken, it does show some coverage. For those that AMD is working to add support for, those will not be complete until ROCm 6, which is due in a few months.
For now, since this isn;t a DS issue, closing this. But if you do have more ROCm/DeepSpeed issues or questions, please either re-open this or open a new issue and link this. We do want to get better ROCm support enabled and tested.