Tensile icon indicating copy to clipboard operation
Tensile copied to clipboard

"Memory access fault by GPU node-1" error in Conv3d.

Open ghost opened this issue 5 years ago • 8 comments

🐛 Bug

Got "Memory access fault by GPU node-1" when training my model, now I can reproduce the problem in a very simple script. the env is ROCM 2.9.6, Radeon VII, I compiled pytorch from the most recent source on master branch. details as follow.

To Reproduce

import torch import torch.nn as nn t=torch.rand(2,32,64,128,160).to('cuda') t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t) #error occurs.

Python 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import torch import torch.nn as nn t=torch.rand(2,32,64,128,160).to('cuda') HIP_DB=0x1 [api] hip-api pid:9748 tid:1:HIP initialized short_tid#1 (maps to full_tid: 0x7fba8044f740) t2=nn.Conv3d(32, 16, kernel_size=3, stride=1, padding=1, bias=False).to('cuda')(t) <<hip-api pid:9748 tid:1.63 9748 1.63 hipLaunchKernel 'ZN12_GLOBAL__N_110hip_fill_nILj256EPjmjEEvT0_T1_T2' gridDim:{163840,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5334006293209 <<hip-api pid:9748 tid:1.69 9748 1.69 hipLaunchKernel 'ZN2at6native14vol2col_kernelIfEEviPKT_iiiiiiiiiiiiiiiiiiPS2' gridDim:{40960,1,1} groupDim:{1024,1,1} sharedMem:+0 stream:0.0 @5340563243577 <<hip-api pid:9748 tid:1.409 9748 1.409 hipLaunchKernel 'Cijk_Ailk_Bljk_SB_MT128x64x8_SE_APM1_AF0EM1_AF1EM1_AMAS3_ASEM1_BL1_DTL0_EPS1_FL1_GRVW4_GSU1_ISA906_IU1_K1_KLA_LPA0_LPB0_LDL1_MGWVW1_NLCA1_NLCB1_PK0_PGR1_PLR1_RK0_SU32_SNLL0_TT8_4_USFGRO0_VAW1_VW4_WG16_16_1_WGM8' gridDim:{10240,1,1} groupDim:{256,1,1} sharedMem:+0 stream:0.0 @5340572207622 Memory access fault by GPU node-1 (Agent handle: 0x55e2fa08a6f0) on address 0x7fb968e02000. Reason: Page not present or supervisor privilege. Aborted (core dumped)

Environment

ROCM Version: 2.9.6

PyTorch version: 1.4.0a0+21ab112 Is debug build: No CUDA used to build PyTorch: Could not collect

OS: Ubuntu 18.04.3 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.12.0

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect

Versions of relevant libraries: [pip] numpy==1.17.3 [pip] torch==1.4.0a0+21ab112 [pip] torchvision==0.2.0 [conda] mkl 2019.4 243
[conda] mkl-include 2019.4 243

ghost avatar Nov 02 '19 13:11 ghost

This looks like an issue with MIOpen. Transferring over.

iotamudelta avatar Nov 04 '19 01:11 iotamudelta

@daniellowell can reproduce issue. Logging shows it is this rocBLAS call:

# MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=7 MIOPEN_ENABLE_LOGGING_CMD=1 ROCBLAS_LAYER=2 python3.6 breakme.py 
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1310720 -n 16 -k 864 --alpha 1 --lda 1310720 --ldb 864 --beta 1 --ldc 1310720
./rocblas-bench -f gemm -r f32_r --transposeA N --transposeB N -m 1310720 -n 16 -k 864 --alpha 1 --lda 1310720 --ldb 864 --beta 1 --ldc 1310720
Memory access fault by GPU node-2 (Agent handle: 0x4464cb0) on address 0x7f12f3701000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

iotamudelta avatar Nov 04 '19 16:11 iotamudelta

@amcamd Can you test the two configs above. @singvision is seeing a segfault. It is pointing to rocBLAS, but it could be the way MIOpen is configuring the parameters.

daniellowell avatar Nov 04 '19 17:11 daniellowell

/cc @bragadeesh

dagamayank avatar Nov 04 '19 19:11 dagamayank

any progress on this issue? @daniellowell @amcamd

ghost avatar Nov 11 '19 12:11 ghost

problem still exist on ROCM 2.10.

ghost avatar Nov 28 '19 07:11 ghost

is there someone following up? I encounter this error, too. is it a bug?

sugar-mouse avatar Jun 12 '20 08:06 sugar-mouse

yes, I send back my Radeon VII to seller and have switched to RTX 2070 because this problem.

dodatko avatar Jun 14 '20 09:06 dodatko