hcc
hcc copied to clipboard
Memory access fault error on ROCm 1.8.151 for MXNet HIP Port
Background: Porting Mxnet Deep Learning framework to ROCm Platform
Issue :
While executing some applications on HIP/ROCm platform we are facing below mentioned issue.(ROCm 1.8.151)
This issue is reproducible in ROCm 1.9 also
Error: Memory access fault by GPU node-x(1 or 2)(Agent handle: 0x236ceb0) on address 0x1000. Reason: Page not present or supervisor privilege.
Environment info:
-
Operating System: Ubunut 16.04
-
ROCm version: $ dpkg -s rocm-dkms Package: rocm-dkms Status: install ok installed Priority: optional Section: devel Installed-Size: 13 Maintainer: Advanced Micro Devices Inc. Architecture: amd64 Version: 1.8.151 Depends: rocm-dev, rock-dkms Description: Radeon Open Compute (ROCm) Runtime software stack Homepage: https://github.com/RadeonOpenCompute/ROCm
-
HIP environment details: $ hipconfig --full HIP version : 1.5.18151 == hipconfig HIP_PATH : /opt/rocm HIP_PLATFORM : hcc CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -I/opt/rocm/include -I/opt/rocm/hcc/include == hcc HSA_PATH : /opt/rocm/hsa HCC_HOME : /opt/rocm/hcc HCC clang version 7.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 86791fc4961dc8ffde77bde20d7dfa5e5cbeff5e) (ssh://gerritgit/compute/ec/hcc-tot/llvm 0ccef158132e1222d549edf2da33d4bc0be6c2d1) (based on HCC 1.2.18184-74f5fa9-86791fc-0ccef15 ) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/rocm/hcc/bin LLVM (http://llvm.org/): LLVM version 7.0.0svn Optimized build. Default target: x86_64-unknown-linux-gnu Host CPU: broadwell Registered Targets: amdgcn - AMD GCN GPUs r600 - AMD GPUs HD2XXX-HD6XXX x86 - 32-bit X86: Pentium-Pro and above x86-64 - 64-bit X86: EM64T and AMD64 HCC-cxxflags : -hc -std=c++amp -I/opt/rocm/hcc/include -I/opt/rocm/includeHCC-ldflags : -hc -std=c++amp -L/opt/rocm/hcc/lib -Wl,--rpath=/opt/rocm/hcc/lib -ldl -lm -lpthread -lhc_am -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive
Analysis:
-
Attaching the full log with following env variables HCC_SERIALIZE_KERNEL=3 HIP_TRACE_API=2 HCC_DB=0x48A log_cnn.txt
-
Issue: hipLaunchKernel 'Cijk_Alik_Bljk_SB_MT128x128x08_BL0_BS1_GRVW01_GSU01_ISA000_K1_KLS_LPB00_NLCA01_NLCB01_PBC0_PGR1_PLR1_TT08_08_USFGRO00_VW01_WG16_16_01_WGM08' gridDim:{3072,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @1126352127710240[0m hcc-cmd tid:1 pushing #0.0.309 completion_signal=0 commandKind=hcCommandKernel <unknown_kernel> hcc-aql tid:1 dispatch_aql #0.0.309(hwq=0x762b000) kernargs=84 header=b02(type=2,barrier=1,acquire=1,release=1) setup=3 grid=[3072.1.1] group=[256.1.1] private_seg_size=0 group_seg_size=16384 kernel_object=0xa022aa900 kernarg_address=0x0x18adc00 completion_signal=0x473c000 hcc-kernarg tid:1 kernarg_address: 0x0x18adc00, total of 84 bytes: 0x00000000x18adc00: 0x12e00000 0x0000000a 0x12f9c000 0x0000000a
0x00000000x18adc10: 0x16200000 0x0000000a 0x3f800000 0x00000000
0x00000000x18adc20: 0x00000000 0x00000000 0x00000000 0x000005dc
0x00000000x18adc30: 0x000249f0 0x00000034 0x000130b0 0x00000034
0x00000000x18adc40: 0x00001450 0x000005dc 0x00000064 0x00000001
0x00000000x18adc50: 0x00000034
Memory access fault by GPU node-1 (Agent handle: 0x1109b10) on address 0x7000. Reason: Page not present or supervisor privilege.
Tried a couple of workarounds by exporting HSA_ENABLE_SDMA=0 and also hipDeviceSynchronize after each kernel launch. But still the problem is seen
Steps to reproduce the issue:
- $ git clone --recursive -b downstream_27Mar2018 --single-branch https://github.com/ROCmSoftwarePlatform/mxnet.git
- $ cd mxnet
- $ export HIP_PLATFORM=hcc
- $ make -jn (n = no of cores)
- $ cd python
- $ sudo python setup.py install
- $ cd ../mxnet/example/cnn_text_classification
- $ python text_cnn.py --gpus n (where n=number of gpus)
By following the steps above, I'm getting compilation errors about struct __half
and sobol32_state
below.
Also, it's using g++
to compile, is that expected?
scchan@97e6e7dfd750:~/code/mxnet$ make VERBOSE=1 Makefile:270: WARNING: Significant performance increases can be achieved by installing and enabling gperftools or jemalloc development packages Running CUDA_ARCH: --amdgpu-target=gfx801 --amdgpu-target=gfx802 --amdgpu-target=gfx803 --amdgpu-target=gfx900 g++ -std=c++11 -c -D__HIP_PLATFORM_HCC__= -I/opt/rocm/hip/include -I/opt/rocm/hcc/include -DMSHADOW_FORCE_STREAM -Wall -Wsign-compare -O3 -DNDEBUG=1 -I. -I./3rdparty/Thrust -I/opt/rocm/hipblas/include -I/opt/rocm/hiprand/include -I/opt/rocm/hcfft/include -I/opt/rocm/rocblas/include -I/opt/rocm/rocrand/include -I/home/scchan/code/mxnet/3rdparty/mshadow/ -I/home/scchan/code/mxnet/3rdparty/dmlc-core/include -fPIC -I/home/scchan/code/mxnet/3rdparty/nnvm/include -I/home/scchan/code/mxnet/3rdparty/dlpack/include -I/home/scchan/code/mxnet/3rdparty/nnvm/tvm/include -Iinclude -funroll-loops -Wno-unused-parameter -Wno-unknown-pragmas -Wno-unused-local-typedefs -msse3 -DMSHADOW_USE_CBLAS=1 -DMSHADOW_USE_MKL=0 -DMSHADOW_RABIT_PS=0 -DMSHADOW_DIST_PS=0 -DMSHADOW_USE_PASCAL=0 -DMXNET_USE_OPENCV=1 -I/usr/include/opencv -fopenmp -DMXNET_USE_OPERATOR_TUNING=1 -DMXNET_USE_LAPACK -I/home/scchan/code/mxnet/3rdparty/cub-hip -DMXNET_USE_NCCL=0 -DMXNET_USE_LIBJPEG_TURBO=0 -MMD -c src/operator/nn/mkldnn/mkldnn_convolution.cc -o build/src/operator/nn/mkldnn/mkldnn_convolution.o In file included from /home/scchan/code/mxnet/3rdparty/mshadow/mshadow/./base.h:29:0, from /home/scchan/code/mxnet/3rdparty/mshadow/mshadow/tensor.h:16, from include/mxnet/./base.h:32, from include/mxnet/io.h:34, from src/operator/nn/mkldnn/../convolution-inl.h:30, from src/operator/nn/mkldnn/mkldnn_convolution.cc:26: ./hip-wrappers.h:17:2: error: conflicting declaration 'typedef struct __half __half' }__half; ^ In file included from /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:1635:0, from /opt/rocm/hip/include/hip/hip_fp16.h:29, from ./hip-wrappers.h:11, from /home/scchan/code/mxnet/3rdparty/mshadow/mshadow/./base.h:29, from /home/scchan/code/mxnet/3rdparty/mshadow/mshadow/tensor.h:16, from include/mxnet/./base.h:32, from include/mxnet/io.h:34, from src/operator/nn/mkldnn/../convolution-inl.h:30, from src/operator/nn/mkldnn/mkldnn_convolution.cc:26: /opt/rocm/hip/include/hip/hcc_detail/hip_fp16_gcc.h:23:12: note: previous declaration as 'struct __half' struct __half { ^ In file included from /opt/rocm/rocrand/include/rocrand_kernel.h:35:0, from /opt/rocm/hiprand/include/hiprand_kernel_hcc.h:37, from /opt/rocm/hiprand/include/hiprand_kernel.h:58, from include/mxnet/./../../src/common/random_generator.h:33, from include/mxnet/./resource.h:31, from include/mxnet/operator.h:39, from src/operator/nn/mkldnn/../convolution-inl.h:33, from src/operator/nn/mkldnn/mkldnn_convolution.cc:26: /opt/rocm/rocrand/include/rocrand_xorwow.h: In member function 'void rocrand_device::xorwow_engine::jump(long long unsigned int, const unsigned int ()[800])': /opt/rocm/rocrand/include/rocrand_xorwow.h:220:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (int i = 0; i < (v & ((1 << XORWOW_JUMP_LOG2) - 1)); i++) ^ In file included from /opt/rocm/rocrand/include/rocrand_kernel.h:36:0, from /opt/rocm/hiprand/include/hiprand_kernel_hcc.h:37, from /opt/rocm/hiprand/include/hiprand_kernel.h:58, from include/mxnet/./../../src/common/random_generator.h:33, from include/mxnet/./resource.h:31, from include/mxnet/operator.h:39, from src/operator/nn/mkldnn/../convolution-inl.h:33, from src/operator/nn/mkldnn/mkldnn_convolution.cc:26: /opt/rocm/rocrand/include/rocrand_sobol32.h: At global scope: /opt/rocm/rocrand/include/rocrand_sobol32.h:81:45: error: declaration of 'typedef struct rocrand_device::sobol32_state<UseSharedVectors> rocrand_device::sobol32_engine<UseSharedVectors>::sobol32_state' [-fpermissive] typedef sobol32_state<UseSharedVectors> sobol32_state; ^ /opt/rocm/rocrand/include/rocrand_sobol32.h:37:8: error: changes meaning of 'sobol32_state' from 'struct rocrand_device::sobol32_state<UseSharedVectors>' [-fpermissive] struct sobol32_state ^ In file included from include/mxnet/./../../src/common/random_generator.h:34:0, from include/mxnet/./resource.h:31, from include/mxnet/operator.h:39, from src/operator/nn/mkldnn/../convolution-inl.h:33, from src/operator/nn/mkldnn/mkldnn_convolution.cc:26: include/mxnet/./../../src/common/../common/cuda_utils.h: In function 'const char mxnet::common::cuda::HiprandGetErrorString(hiprandStatus_t)': include/mxnet/./../../src/common/../common/cuda_utils.h:126:10: warning: enumeration value 'HIPRAND_STATUS_DOUBLE_PRECISION_REQUIRED' not handled in switch [-Wswitch] switch (status) { ^ include/mxnet/./../../src/common/../common/cuda_utils.h:126:10: warning: enumeration value 'HIPRAND_STATUS_NOT_IMPLEMENTED' not handled in switch [-Wswitch] Makefile:436: recipe for target 'build/src/operator/nn/mkldnn/mkldnn_convolution.o' failed make: *** [build/src/operator/nn/mkldnn/mkldnn_convolution.o] Error 1
1)Issue: error: changes meaning of 'sobol32_state' from 'struct rocrand_device::sobol32_state' [-fpermissive] Fix: As per the ticket https://github.com/ROCmSoftwarePlatform/rocRAND/pull/23 , please update in /opt/rocm/rocrand/include/rocrand_sobol32.h line 81 with typedef struct sobol32_state sobol32_state;
2)Issue: error: conflicting declaration 'typedef struct __half __half' Fix: The __half definition was added in hip-wrappers.h as a workaround for one of the issue related to HIP, the same __half definition is now part of HIP release. Please comment out the workaround at hip-wrappers.h line number 14 to 18.
I wasn't able to reproduce the memory error you reported. I'm using one GPU, how many GPUs are you using? The error message you are getting suggests that memory accesses in that kernel is causing a page fault. Could you verify whether the memory buffers are valid or any out-of-bound accesses occur?
On my system, it hangs in this kernel:
<<hip-api tid:2.889 2.889 hipLaunchKernel 'ZN7mshadow4cuda13MapPlanKernelINS_2sv6savetoELi8ENS_4expr4PlanINS4_8SliceExpINS_6TensorINS_3gpuELi3EfEES8_fLi3ELi2EEEfEENS5_IS9_fEEEEvT1_jNS_5ShapeILi2EEET2' gridDim:{5120,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.1 @1911798253184
We are using single gpu to run the application. The gpu used is AMD FIREPRO(S9300 x2). Gpu details: GPU[0] : GPU ID: 0x7300
Please find the attached full log(cnn_logt.txt) by exporting HIP_TRACE_API=7.
As shown below, the MapPlanKernel launches succesfully on firepro gpu
[[32m<<hip-api tid:2.7 2.7 hipLaunchKernel 'ZN7mshadow4cuda13MapPlanKernelINS_2sv6savetoELi8ENS_4expr4PlanINS_6TensorINS_3gpuELi2EfEEfEENS5_INS4_9ScalarExpIfEEfEEEEvT1_jNS_5ShapeILi2EEET2' gridDim:{2816,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.1 @160706913620101^[[0m ^[[32mhip-api tid:2.7 hipModuleLaunchKernel ret= 0 (hipSuccess)>> +121962 ns^[[0m
I was able to reproduce the memory access error in void mxnet::op::col2im_gpu_kernel<float>
on a fiji GPU. As I mentioned earlier, this type of error usually suggests a out-of-bound access. There are 2 buffers being accessed in this kernel (e.g. data_col, data_im). It's unclear to me that the buffer sizes are. Could you check whether all the accesses are valid that the indices don't go out-of-bound?
@scchan Thanks scchan for inputs I tried debugging the col2im_gpu_kernel parameters as per your inputs using gbd and rocm-gdb but could not get the symbol data
With gdb: Unable to read some of the symbols for validation
With rocm-gdb: Unable to debug as it is not hitting break ponints
Following is the gdb log. Need your inputs to proceed further( validating the symbols / data) gdb-log.txt
We don't generate debug information in the current version of hcc and also I don't think a debugger would help you much for this. For simplicity, can you pass in atomic counter to count how many out-of-bound accesses the kernel may perform?
@scchan
we have noticed that the the col2im_gpu_kernel kernel generates the error at every second hit of the kernel.
The col2im_gpu_kernel definition is as follows. template <typename DType> global void col2im_gpu_kernel(const int n, const DType* data_col, const int channels, const int height, const int width, const int kernel_h, const int kernel_w, const int pad_h, const int pad_w, const int stride_h, const int stride_w, const int dilation_h, const int dilation_w, const int height_col, const int width_col, DType* data_im, OpReqType req) { CUDA_KERNEL_LOOP(index, n) { DType val = 0; const int w_im = index % width + pad_w; const int h_im = (index / width) % height + pad_h; const int c_im = index / (width * height); int kernel_extent_w = (kernel_w - 1) * dilation_w + 1; int kernel_extent_h = (kernel_h - 1) * dilation_h + 1; // compute the start and end of the output const int w_col_start = (w_im < kernel_extent_w) ? 0 : (w_im - kernel_extent_w) / stride_w + 1; const int w_col_end = min(w_im / stride_w + 1, width_col); const int h_col_start = (h_im < kernel_extent_h) ? 0 : (h_im - kernel_extent_h) / stride_h + 1; const int h_col_end = min(h_im / stride_h + 1, height_col); // TODO(caffe): use LCM of stride and dilation to avoid unnecessary loops for (int h_col = h_col_start; h_col < h_col_end; h_col += 1) { for (int w_col = w_col_start; w_col < w_col_end; w_col += 1) { int h_k = (h_im - h_col * stride_h); int w_k = (w_im - w_col * stride_w); if (h_k % dilation_h == 0 && w_k % dilation_w == 0) { h_k /= dilation_h; w_k /= dilation_w; int data_col_index = (((c_im * kernel_h + h_k) * kernel_w + w_k) * height_col + h_col) * width_col + w_col; val += data_col[data_col_index]; } } } KERNEL_ASSIGN(data_im[index], req, val); } }
We understand for some of the trivial cases we can use hipMemcpy(P1,P2,nSize,hipMemcpyDeviceToHost) and dump the results, but in above complex scenario is there any other approach to identify the tricky issue like memory out of bound?