DeepSpeed [BUG] ROCm for transformer-inference model hipcc compile error

[BUG] ROCm for transformer-inference model hipcc compile error

Open yuchen2580 opened this issue 1 year ago • 3 comments

Describe the bug Errors are triggered when compiling the cuda files of transformer_inference on ROCm. The hip process runs without error (e.g. layer_norm.cu -> layer_norm.hip) however, when compiling the layer_norm.cuda.o and other related cuda files, these errors jumps out:

`Total number of unsupported CUDA function calls: 0

Total number of replaced kernel launches: 22 [1/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.hip -o layer_norm.cuda.o FAILED: layer_norm.cuda.o /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.hip -o layer_norm.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.hip:8: In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/conversion_utils_hip.h:9: In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/ds_kernel_utils_hip.h:24: In file included from /opt/rocm-5.4.2/include/hip/hip_cooperative_groups.h:38: /opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:527:3: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tile size is either not a power of 2 or greater than the wavefront size static_assert(is_valid_tile_size::value, ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:563:39: note: in instantiation of template class 'cooperative_groups::thread_block_tile_base<64>' requested here class thread_block_tile_type : public thread_block_tile_base<tileSize>, public tiled_group { ^ /opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:624:43: note: in instantiation of template class 'cooperative_groups::thread_block_tile_type<64>' requested here class thread_block_tile_internal : public thread_block_tile_type<size, ParentCGTy> { ^ /opt/rocm-5.4.2/include/hip/amd_detail/amd_hip_cooperative_groups.h:650:46: note: in instantiation of template class 'cooperative_groups::impl::thread_block_tile_internal<64, void>' requested here class thread_block_tile<size, void> : public impl::thread_block_tile_internal<size, void> { ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:346:44: note: in instantiation of template class 'cooperative_groups::thread_block_tile<64, void>' requested here data[0] = element<Op>(data[0], warp.shfl_xor(data[0], i)); ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:355:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[0] = element<Op1>(data[0], warp.shfl_xor(data[0], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:356:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[1] = element<Op2>(data[1], warp.shfl_xor(data[1], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:365:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[0] = element<Op1>(data[0], warp.shfl_xor(data[0], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:366:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[1] = element<Op2>(data[1], warp.shfl_xor(data[1], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:367:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[2] = element<Op3>(data[2], warp.shfl_xor(data[2], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:376:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[0] = element<Op1>(data[0], warp.shfl_xor(data[0], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:377:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[1] = element<Op2>(data[1], warp.shfl_xor(data[1], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:378:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[2] = element<Op3>(data[2], warp.shfl_xor(data[2], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:379:46: error: no member named 'shfl_xor' in 'cooperative_groups::thread_block_tile<64, void>' data[3] = element<Op4>(data[3], warp.shfl_xor(data[3], i)); ~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:415:18: error: no member named 'meta_group_size' in 'cooperative_groups::thread_block_tile<64, void>' if (warp_arg.meta_group_size() > 1 && total_warps != 1) { ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:416:22: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>' if (warp_arg.thread_rank() == 0) { ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:420:54: error: no member named 'meta_group_rank' in 'cooperative_groups::thread_block_tile<64, void>' reduce_buffer + elems * warp_arg.meta_group_rank() + i, data + i); ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:427:22: error: no member named 'meta_group_rank' in 'cooperative_groups::thread_block_tile<64, void>' if (warp_arg.meta_group_rank() == 0) { ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:428:26: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>' if (warp_arg.thread_rank() < warp_arg.meta_group_size()) { ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:428:51: error: no member named 'meta_group_size' in 'cooperative_groups::thread_block_tile<64, void>' if (warp_arg.thread_rank() < warp_arg.meta_group_size()) { ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:432:68: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>' data + i, reduce_buffer + elems * warp_arg.thread_rank() + i); ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:442:82: error: no member named 'thread_rank' in 'cooperative_groups::thread_block_tile<64, void>' mem_access::store_shared(reduce_buffer + elems * warp_arg.thread_rank() + i, ~~~~~~~~ ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes/reduction_utils_hip.h:453:69: error: no member named 'meta_group_rank' in 'cooperative_groups::thread_block_tile<64, void>' reduce_buffer + warp_arg.meta_group_rank() * elems + i); ~~~~~~~~ ^ fatal error: too many errors emitted, stopping now [-ferror-limit=] 20 errors generated when compiling for gfx1030. [2/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip -o apply_rotary_pos_emb.cuda.o FAILED: apply_rotary_pos_emb.cuda.o /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip -o apply_rotary_pos_emb.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:154:48: error: use of undeclared identifier 'shfl_sync' auto q_rot_tmp = lane < half_dim ? shfl_sync(mask[lane], q_rot, lane + half_dim) ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:155:48: error: use of undeclared identifier 'shfl_sync' : shfl_sync(mask[lane], q_rot, lane - half_dim); ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:156:48: error: use of undeclared identifier 'shfl_sync' auto k_rot_tmp = lane < half_dim ? shfl_sync(mask[lane], k_rot, lane + half_dim) ^ /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.hip:157:48: error: use of undeclared identifier 'shfl_sync' : shfl_sync(mask[lane], k_rot, lane - half_dim); ^ 4 errors generated when compiling for gfx1030. [3/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS -U__HIP_NO_HALF_CONVERSIONS -U__HIP_NO_HALF2_OPERATORS -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.hip -o transform.cuda.o FAILED: transform.cuda.o /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS -U__HIP_NO_HALF_CONVERSIONS -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.hip -o transform.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.hip:745:6: error: function template partial specialization is not allowed void launch_transform4d_0213<T>(T* out, ^ ~~~ 1 error generated when compiling for gfx1030. [4/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.hip -o relu.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. [5/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.hip -o dequantize.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. [6/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS_=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.hip -o gelu.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. [7/9] c++ -MMD -MF pt_binding_hip.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp -o pt_binding_hip.o -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 FAILED: pt_binding_hip.o c++ -MMD -MF pt_binding_hip.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp -o pt_binding_hip.o -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 In file included from /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding_hip.cpp:11: /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes/inference_context_hip.h:14:10: fatal error: rocblas/rocblas.h: No such file or directory 14 | #include "rocblas/rocblas.h" | ^~~~~~~~~~~~~~~~~~~ compilation terminated. [8/9] /opt/rocm-5.4.2/bin/hipcc -DWITH_HIP -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/includes -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/TH -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THC -isystem /proj/deepspeed_env/lib/python3.7/site-packages/torch/include/THH -isystem /opt/rocm-5.4.2/include -isystem /opt/rocm-5.4.2/miopen/include -isystem /opt/rocm-5.4.2/hip/include -isystem /proj/deepspeed_env/include/python3.7m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -fPIC -D__HIP_PLATFORM_HCC=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS=1 -D__HIP_NO_HALF_CONVERSIONS_=1 -O3 -std=c++14 -U__HIP_NO_HALF_OPERATORS__ -U__HIP_NO_HALF_CONVERSIONS__ -U__HIP_NO_HALF2_OPERATORS__ -DROCM_VERSION_MAJOR=5 -DROCM_VERSION_MINOR=4 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030 -fno-gpu-rdc -c /proj/deepspeed_env/lib/python3.7/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.hip -o softmax.cuda.o Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead. ninja: build stopped: subcommand failed. `

To Reproduce Steps to reproduce the behavior:

Any demo code with deepspeed + huggingface config can trigger this. (replace_with_kernel_inject=True)
Environment: ROCm 5.4.2, transformers 4.28.1

Expected behavior I expected it to run as in the git CI on Mi100.

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn is not compatible with ROCM sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/proj/rdi/staff/ethany/anaconda3/envs/deepspeed_amd_new/lib/python3.7/site-packages/torch'] torch version .................... 1.13.1+rocm5.2 deepspeed install path ........... ['/proj/rdi/staff/ethany/anaconda3/envs/deepspeed_amd_new/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.2+unknown, unknown, unknown torch cuda version ............... None torch hip version ................ 5.2.21151-afdc89f8 nvcc version ..................... None deepspeed wheel compiled w. ...... torch 1.13, hip 5.2

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU: 1 machine 8x Mi100 GPU
deepspeed 0.9.2 and master
transformers=4.28.1
Python: 3.7

Docker context None

Additional context None

May 16 '23 06:05 yuchen2580

I am seeing and trying to debug the same errors. There are two types of errors above.

The first one is about an invalid tile size:

In file included from csrc/quantization/dequantize.hip:8:
In file included from csrc/includes/dequantization_utils_hip.h:7:
In file included from csrc/includes/conversion_utils_hip.h:9:
In file included from csrc/includes/ds_kernel_utils_hip.h:24:
In file included from /opt/rocm-5.4.2/hip/include/hip/hip_cooperative_groups.h:26:
In file included from /opt/rocm-5.4.2/hip/include/hip/../../../include/hip/hip_cooperative_groups.h:38:
In file included from /opt/rocm-5.4.2/hip/include/hip/amd_detail/amd_hip_cooperative_groups.h:26:
/opt/rocm-5.4.2/hip/include/hip/amd_detail/../../../../include/hip/amd_detail/amd_hip_cooperative_groups.h:527:3: error: static assertion failed due to requirement 'integral_constant<bool, false>::value': Tile size is either not a power of 2 or greater than the wavefront size
  static_assert(is_valid_tile_size<size>::value,
  ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

That appears to be triggered from this check: https://github.com/ROCm-Developer-Tools/hipamd/blob/474e8620099a463ad2ced821ae7400609b29bf7f/include/hip/amd_detail/hip_cooperative_groups_helper.h#L53-L73

As it also is in my case, the compile line above shows that hipcc is building for a set of gfx architectures:

--amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 --amdgpu-target=gfx90a --amdgpu-target=gfx1030

The gfx1030 entry seems to be problematic for this first error, apparently because it uses a wavefront size of 32 instead of 64. If one does not have that GPU, one work around is to disable that target. For example, I have gfx90a and I can create a build for just gfx90a with something like the following:

export PYTORCH_ROCM_ARCH=gfx90a
DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 python3 setup.py bdist_wheel

That probably requires one to build the DeepSpeed ops upfront.

I don't have a work around yet for the second type of error regarding the missing member functions.

Jun 01 '23 05:06 adammoody

@adammoody Thanks, I wonder how did they pass the CI test for this on Mi100

Jun 08 '23 05:06 yuchen2580

I'm not sure. Perhaps this is using a wheel with pre-built ops. I did find that I could build a wheel for AMD by disabling a number of unsupported ops.

DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_QUANTIZER=0 DS_BUILD_RANDOM_LTD=0 DS_BUILD_TRANSFORMER_INFERENCE=0 python3 setup.py bdist_wheel

AMD is actively working to implement support for those.

https://github.com/microsoft/DeepSpeed/issues/3091#issuecomment-1572701712

Jun 08 '23 20:06 adammoody

Hi @adammoody and @yuchen2580 - not all DeepSpeed ops currently build on DeepSpeed, though the MI200 CI test is broken, it does show some coverage. For those that AMD is working to add support for, those will not be complete until ROCm 6, which is due in a few months.

For now, since this isn;t a DS issue, closing this. But if you do have more ROCm/DeepSpeed issues or questions, please either re-open this or open a new issue and link this. We do want to get better ROCm support enabled and tested.

Aug 14 '23 19:08 loadams

DeepSpeed DeepSpeed copied to clipboard

[BUG] ROCm for transformer-inference model hipcc compile error

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

DeepSpeed
DeepSpeed copied to clipboard