MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

[BUG][GFX1030] Random Memory access faults on gfx1030.

Open shurale-nkn opened this issue 2 years ago • 9 comments

[Keywords]: test; gfx1030;

[Description]: Random Memory access faults on gfx1030. 5 different PRs failed at a random stage, but always with gfx1030.

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/test-int8-mlir-nonxdlops/2/pipeline

log info
Full Tests I / Fp16 Hip All gfx1030

NODE_NAME = ixt-sjc2-16

27/103 Test  #24: test_gru ..............................................   Passed   20.59 sec

[2022-06-26T20:30:09.726Z]         Start  26: test_handle_test

[2022-06-26T22:23:59.557Z]  28/103 Test  #12: test_conv2d ...........................................***Failed  8388.27 sec

[2022-06-26T22:23:59.557Z] Memory access fault by GPU node-1 (Agent handle: 0xebf2f0) on address 0x7fb9990c8000. Reason: Page not present or supervisor privilege.

[2022-06-26T22:23:59.557Z] CMake Error at test_test_conv2d.cmake:7 (message):

[2022-06-26T22:23:59.557Z]   Test failed

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/int8-perf-config-tuning/16/pipeline

log info
Full Tests I / Fp16 Hip All gfx1030

NODE_NAME = rocm-framework-19.amd.com

3/106 Test  #14: test_conv3d ............................................   Passed  238.85 sec

[2022-06-25T06:40:04.951Z]         Start  45: test_soft_max

[2022-06-25T06:41:12.731Z]   4/106 Test  #12: test_conv2d ............................................***Failed  307.92 sec

[2022-06-25T06:41:12.731Z] Memory access fault by GPU node-2 (Agent handle: 0x1227680) on address 0x7f1e5756a000. Reason: Page not present or supervisor privilege.

[2022-06-25T06:41:12.731Z] CMake Error at test_test_conv2d.cmake:7 (message):

[2022-06-25T06:41:12.731Z]   Test failed

[2022-06-25T06:41:12.731Z]

[2022-06-25T06:41:12.731Z]

[2022-06-25T06:41:12.731Z]

[2022-06-25T06:41:12.731Z]         Start  69: test_conv_for_implicit_gemm

[2022-06-25T06:42:49.280Z]   5/106 Test  #28: test_immed_conv3d ......................................   Passed  401.59 sec

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/int8-perf-config-tuning/17/pipeline/625

log info
Full Tests II / Fp32 OpenCL All gfx1030

NODE_NAME = ixt-sjc2-16

 
34/107 Test  #35: test_mdgraph ...........................................   Passed    0.45 sec

[2022-06-26T10:50:29.529Z]         Start  36: test_na_inference

[2022-06-26T10:50:31.821Z]  35/107 Test  #36: test_na_inference ......................................***Failed    1.99 sec

[2022-06-26T10:50:31.821Z] Memory access fault by GPU node-1 (Agent handle: 0x55e307726530) on address 0x7f3e4319e000. Reason: Page not present or supervisor privilege.

[2022-06-26T10:50:31.821Z] CMake Error at test_test_na_inference.cmake:7 (message):

[2022-06-26T10:50:31.821Z]   Test failed

[2022-06-26T10:50:31.821Z]

[2022-06-26T10:50:31.821Z]

[2022-06-26T10:50:31.821Z]

[2022-06-26T10:50:31.821Z]         Start  37: test_na_train

[2022-06-26T10:52:28.950Z]  36/107 Test  #37: test_na_train ..........................................   Passed  110.72 sec

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jd%2Fck_integration/64/pipeline/255

log info
Full Tests I / Fp16 Hip All gfx1030

NODE_NAME = rocm-framework-19.amd.com


[2022-06-27T16:32:36.646Z]  61/107 Test  #97: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x4 ...   Passed  110.25 sec

[2022-06-27T16:32:36.646Z]         Start  99: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x8

[2022-06-27T16:32:44.922Z]  62/107 Test  #99: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x8 ...***Failed   12.83 sec


[2022-06-27T16:32:44.922Z] /home/jenkins/workspace/MLLibs_MIOpen_jd_ck_integration/build/bin/test_conv2d --half --cmode convfp16 --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 32 160 73 73 --weights 160 1 1 64 --batch_size 32 --input_channels 160 --output_channels 64 --spatial_dim_elements 73 73 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout CHWN --out_layout NCHW --output_type int32 --int8_vectorize 0 --vector_length 8 --tensor_vect 1

[2022-06-27T16:32:44.922Z] error: 0

[2022-06-27T16:32:44.922Z] Max diff: 0

[2022-06-27T16:32:44.922Z] Forward convolution: ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC

[2022-06-27T16:32:44.922Z] Input tensor: 32, 20, 73, 73

[2022-06-27T16:32:44.922Z] Weights tensor: 20, 1, 1, 64

[2022-06-27T16:32:44.922Z] Output tensor: 32, 8, 73, 73

[2022-06-27T16:32:44.922Z] Filter: conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},

[2022-06-27T16:32:44.922Z] Memory access fault by GPU node-2 (Agent handle: 0x1d63c00) on address 0x7fb315c7a000. Reason: Page not present or supervisor privilege.

[2022-06-27T16:32:44.922Z] Aborted (core dumped)

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jd%2Fck_integration/67/pipeline/310

log info
Full Tests II / Fp32 OpenCL All gfx1030

NODE_NAME = ixt-sjc2-16

57/104 Test #72: test_rnn_extra ........................................***Failed 27.72 sec

….

[2022-06-29T18:08:10.070Z] ../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-dhy --use-dropout 0 --in-mode 0 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 32 32 32

[2022-06-29T18:08:10.070Z] error: 2.61185e-09

[2022-06-29T18:08:10.070Z] Max diff: 2.98023e-07

[2022-06-29T18:08:10.070Z] Mismatch at 3: 0.0993099 != 0.0993099

[2022-06-29T18:08:10.070Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m relu -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 1 -p 0 -U 0

[2022-06-29T18:08:10.070Z] Backward Weights RNN vanilla:

[2022-06-29T18:08:10.070Z] Memory access fault by GPU node-1 (Agent handle: 0x559742012550) on address 0x7f7768be4000. Reason: Page not present or supervisor privilege.

[2022-06-29T18:08:10.070Z] Aborted (core dumped)

[2022-06-29T18:08:10.070Z] test/CMakeFiles/test_rnn_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_rnn_extra' failed

[2022-06-29T18:08:10.070Z] make[7]: *** [test/CMakeFiles/test_rnn_extra] Error 134

[2022-06-29T18:08:10.070Z] CMakeFiles/Makefile2:12913: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/all' failed

[2022-06-29T18:08:10.070Z] make[6]: *** [test/CMakeFiles/test_rnn_extra.dir/all] Error 2

[2022-06-29T18:08:10.070Z] CMakeFiles/Makefile2:12920: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed

[2022-06-29T18:08:10.071Z] make[5]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2

[2022-06-29T18:08:10.071Z] Makefile:2309: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed

[2022-06-29T18:08:10.071Z] make[4]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2

[2022-06-29T18:08:10.071Z]

[2022-06-29T18:08:10.071Z]         Start  73: test_gru_extra

[2022-06-29T18:09:03.794Z]  58/104 Test  #73: test_gru_extra ........................................   Passed   50.99 sec

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/dfeng_int8_quantization_api/2/pipeline/1554

log info
NODE_NAME = rocm-framework-19.amd.com
 

[2022-06-28T20:36:58.595Z]  64/106 Test #101: test_conv_ck_igemm_fwd_v6r1_dlops_nchw .................***Failed   28.99 sec

[2022-06-28T20:36:58.595Z] [  2%] Built target sqlite_memvfs

[2022-06-28T20:36:58.595Z] [  2%] Built target addkernels

[2022-06-28T20:36:58.595Z] [100%] Built target MIOpen

[2022-06-28T20:36:58.595Z] [100%] Built target test_conv2d

[2022-06-28T20:36:58.595Z] Scanning dependencies of target test_conv_ck_igemm_fwd_v6r1_dlops_nchw

[2022-06-28T20:36:58.595Z] /home/jenkins/workspace/Open_dfeng_int8_quantization_api/build/bin/test_conv2d --half --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 128 1024 14 14 --weights 2048 1024 1 1 --batch_size 128 --input_channels 1024 --output_channels 2048 --spatial_dim_elements 14 14 --filter_dims 1 1 --pads_strides_dilations 0 0 2 2 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --tensor_vect 0 --vector_length 1

[2022-06-28T20:36:58.595Z] Memory access fault by GPU node-2 (Agent handle: 0x80e5a0) on address 0x7f8f694e8000. Reason: Page not present or supervisor privilege.

[2022-06-28T20:36:58.595Z] Aborted (core dumped)

[2022-06-28T20:36:58.595Z] test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw.dir/build.make:57: recipe for target 'test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw' failed

[2022-06-28T20:36:58.595Z] make[7]: *** [test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw] Error 134

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_1549/7/pipeline/1640

log info
NODE_NAME = ixt-sjc2-22

[2022-06-28T10:19:22.347Z] 58/107 Test #72: test_gru_extra .........................................***Failed 13.69 sec
….
[2022-06-28T10:19:22.348Z] ../bin/test_gru --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-hx --no-dhy --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 32 32 32 
[2022-06-28T10:19:22.349Z] error: 4.26209e-08
[2022-06-28T10:19:22.349Z] Max diff: 2.98023e-08
[2022-06-28T10:19:22.349Z] Mismatch at 1: -0.0144987 != -0.0144987
[2022-06-28T10:19:22.349Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m gru -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 0 -p 0
[2022-06-28T10:19:22.349Z] inputMode: 0 biasMode: 0 dirMode: 0
[2022-06-28T10:19:22.349Z] hz: 128 batch_n: 96 seqLength: 3 inputLen: 128 numLayers: 1
[2022-06-28T10:19:22.349Z] Forward Inference GRU: 
[2022-06-28T10:19:22.349Z] Output tensor output failed verification.
[2022-06-28T10:19:22.349Z] Memory access fault by GPU node-1 (Agent handle: 0x55a84333b040) on address 0x7f13174ea000. Reason: Page not present or supervisor privilege.
[2022-06-28T10:19:22.349Z] Aborted (core dumped)

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_1549/9/pipeline/625

log info
NODE_NAME = ixt-sjc2-22

[2022-06-29T06:55:33.965Z] 57/107 Test #71: test_rnn_extra .........................................***Failed 73.52 sec

../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-dhx --use-dropout 0 --in-mode 0 --bias-mode 1 --dir-mode 0 --rnn-mode 1 --batch-seq 32 32 32 
[2022-06-29T06:55:33.972Z] error: 4.23637e-09
[2022-06-29T06:55:33.972Z] Max diff: 8.34465e-07
[2022-06-29T06:55:33.972Z] Mismatch at 4: 0.404517 != 0.404518
[2022-06-29T06:55:33.972Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m tanh -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 1 -p 0 -U 0
[2022-06-29T06:55:33.972Z] Backward Weights RNN vanilla: 
[2022-06-29T06:55:33.972Z] Memory access fault by GPU node-1 (Agent handle: 0x560ccf50e640) on address 0x7fb3233b0000. Reason: Page not present or supervisor privilege.
[2022-06-29T06:55:33.972Z] Aborted (core dumped)

shurale-nkn avatar Jun 30 '22 14:06 shurale-nkn

@atamazov FYI

shurale-nkn avatar Jun 30 '22 14:06 shurale-nkn

I'll try to look into this.

atamazov avatar Jul 01 '22 19:07 atamazov

http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/688/pipeline/1641

log info
Start  71: test_rnn_extra
[2022-07-05T02:22:03.245Z]  57/107 Test  #71: test_rnn_extra .........................................***Failed    1.04 sec
[2022-07-05T02:22:03.245Z] [  2%] Built target sqlite_memvfs
[2022-07-05T02:22:03.245Z] [  2%] Built target addkernels
[2022-07-05T02:22:03.245Z] [ 97%] Built target MIOpen
[2022-07-05T02:22:03.245Z] [100%] Built target test_rnn_vanilla
[2022-07-05T02:22:03.245Z] Scanning dependencies of target test_rnn_extra
[2022-07-05T02:22:03.245Z] MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
[2022-07-05T02:22:03.245Z] MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
[2022-07-05T02:22:03.245Z] ../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-hx --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --rnn-mode 0 --batch-seq 32 32 32
[2022-07-05T02:22:03.245Z] Memory access fault by GPU node-1 (Agent handle: 0x227a460) on address 0x7f5059ff2000. Reason: Page not present or supervisor privilege.
[2022-07-05T02:22:03.245Z] Aborted (core dumped)
[2022-07-05T02:22:03.245Z] test/CMakeFiles/test_rnn_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_rnn_extra' failed
[2022-07-05T02:22:03.245Z] make[7]: *** [test/CMakeFiles/test_rnn_extra] Error 134
[2022-07-05T02:22:03.245Z] CMakeFiles/Makefile2:12926: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/all' failed
[2022-07-05T02:22:03.245Z] make[6]: *** [test/CMakeFiles/test_rnn_extra.dir/all] Error 2
[2022-07-05T02:22:03.245Z] CMakeFiles/Makefile2:12933: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-07-05T02:22:03.245Z] make[5]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-07-05T02:22:03.245Z] Makefile:2234: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-07-05T02:22:03.245Z] make[4]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-07-05T02:22:03.245Z]
[2022-07-05T02:22:03.245Z]         Start  72: test_gru_extra
[2022-07-05T02:24:31.186Z]  58/107 Test  #72: test_gru_extra .........................................   Passed  138.00 sec
[2022-07-05T02:24:31.187Z]         Start  73: test_lstm_extra

shurale-nkn avatar Jul 05 '22 18:07 shurale-nkn

It looks like the issue has been solved, let me close this issue. Please feel free to re-open it if not resolved yet.

aska-0096 avatar Aug 04 '22 02:08 aska-0096

Not fixed!

shurale-nkn avatar Aug 04 '22 11:08 shurale-nkn

Not fixed!

Sorry for that. Also pin it back.

aska-0096 avatar Aug 04 '22 14:08 aska-0096

So is there a way to solve this problem?

tangerdream avatar Dec 26 '22 03:12 tangerdream

@shurale-nkn Is it possible to reliably reproduce the issue?

atamazov avatar Dec 26 '22 22:12 atamazov

@shurale-nkn Is this fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!

ppanchad-amd avatar Apr 16 '24 16:04 ppanchad-amd