MIOpen
MIOpen copied to clipboard
[BUG][GFX1030] Random Memory access faults on gfx1030.
[Keywords]: test; gfx1030;
[Description]: Random Memory access faults on gfx1030. 5 different PRs failed at a random stage, but always with gfx1030.
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/test-int8-mlir-nonxdlops/2/pipeline
log info
Full Tests I / Fp16 Hip All gfx1030
NODE_NAME = ixt-sjc2-16
27/103 Test #24: test_gru .............................................. Passed 20.59 sec
[2022-06-26T20:30:09.726Z] Start 26: test_handle_test
[2022-06-26T22:23:59.557Z] 28/103 Test #12: test_conv2d ...........................................***Failed 8388.27 sec
[2022-06-26T22:23:59.557Z] Memory access fault by GPU node-1 (Agent handle: 0xebf2f0) on address 0x7fb9990c8000. Reason: Page not present or supervisor privilege.
[2022-06-26T22:23:59.557Z] CMake Error at test_test_conv2d.cmake:7 (message):
[2022-06-26T22:23:59.557Z] Test failed
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/int8-perf-config-tuning/16/pipeline
log info
Full Tests I / Fp16 Hip All gfx1030
NODE_NAME = rocm-framework-19.amd.com
3/106 Test #14: test_conv3d ............................................ Passed 238.85 sec
[2022-06-25T06:40:04.951Z] Start 45: test_soft_max
[2022-06-25T06:41:12.731Z] 4/106 Test #12: test_conv2d ............................................***Failed 307.92 sec
[2022-06-25T06:41:12.731Z] Memory access fault by GPU node-2 (Agent handle: 0x1227680) on address 0x7f1e5756a000. Reason: Page not present or supervisor privilege.
[2022-06-25T06:41:12.731Z] CMake Error at test_test_conv2d.cmake:7 (message):
[2022-06-25T06:41:12.731Z] Test failed
[2022-06-25T06:41:12.731Z]
[2022-06-25T06:41:12.731Z]
[2022-06-25T06:41:12.731Z]
[2022-06-25T06:41:12.731Z] Start 69: test_conv_for_implicit_gemm
[2022-06-25T06:42:49.280Z] 5/106 Test #28: test_immed_conv3d ...................................... Passed 401.59 sec
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/int8-perf-config-tuning/17/pipeline/625
log info
Full Tests II / Fp32 OpenCL All gfx1030
NODE_NAME = ixt-sjc2-16
34/107 Test #35: test_mdgraph ........................................... Passed 0.45 sec
[2022-06-26T10:50:29.529Z] Start 36: test_na_inference
[2022-06-26T10:50:31.821Z] 35/107 Test #36: test_na_inference ......................................***Failed 1.99 sec
[2022-06-26T10:50:31.821Z] Memory access fault by GPU node-1 (Agent handle: 0x55e307726530) on address 0x7f3e4319e000. Reason: Page not present or supervisor privilege.
[2022-06-26T10:50:31.821Z] CMake Error at test_test_na_inference.cmake:7 (message):
[2022-06-26T10:50:31.821Z] Test failed
[2022-06-26T10:50:31.821Z]
[2022-06-26T10:50:31.821Z]
[2022-06-26T10:50:31.821Z]
[2022-06-26T10:50:31.821Z] Start 37: test_na_train
[2022-06-26T10:52:28.950Z] 36/107 Test #37: test_na_train .......................................... Passed 110.72 sec
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jd%2Fck_integration/64/pipeline/255
log info
Full Tests I / Fp16 Hip All gfx1030
NODE_NAME = rocm-framework-19.amd.com
[2022-06-27T16:32:36.646Z] 61/107 Test #97: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x4 ... Passed 110.25 sec
[2022-06-27T16:32:36.646Z] Start 99: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x8
[2022-06-27T16:32:44.922Z] 62/107 Test #99: test_conv_igemm_dynamic_dlops_nchwc_chwnc_fwd_fp16x8 ...***Failed 12.83 sec
[2022-06-27T16:32:44.922Z] /home/jenkins/workspace/MLLibs_MIOpen_jd_ck_integration/build/bin/test_conv2d --half --cmode convfp16 --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 32 160 73 73 --weights 160 1 1 64 --batch_size 32 --input_channels 160 --output_channels 64 --spatial_dim_elements 73 73 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout CHWN --out_layout NCHW --output_type int32 --int8_vectorize 0 --vector_length 8 --tensor_vect 1
[2022-06-27T16:32:44.922Z] error: 0
[2022-06-27T16:32:44.922Z] Max diff: 0
[2022-06-27T16:32:44.922Z] Forward convolution: ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC
[2022-06-27T16:32:44.922Z] Input tensor: 32, 20, 73, 73
[2022-06-27T16:32:44.922Z] Weights tensor: 20, 1, 1, 64
[2022-06-27T16:32:44.922Z] Output tensor: 32, 8, 73, 73
[2022-06-27T16:32:44.922Z] Filter: conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
[2022-06-27T16:32:44.922Z] Memory access fault by GPU node-2 (Agent handle: 0x1d63c00) on address 0x7fb315c7a000. Reason: Page not present or supervisor privilege.
[2022-06-27T16:32:44.922Z] Aborted (core dumped)
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/jd%2Fck_integration/67/pipeline/310
log info
Full Tests II / Fp32 OpenCL All gfx1030
NODE_NAME = ixt-sjc2-16
57/104 Test #72: test_rnn_extra ........................................***Failed 27.72 sec
….
[2022-06-29T18:08:10.070Z] ../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-dhy --use-dropout 0 --in-mode 0 --bias-mode 1 --dir-mode 0 --rnn-mode 0 --batch-seq 32 32 32
[2022-06-29T18:08:10.070Z] error: 2.61185e-09
[2022-06-29T18:08:10.070Z] Max diff: 2.98023e-07
[2022-06-29T18:08:10.070Z] Mismatch at 3: 0.0993099 != 0.0993099
[2022-06-29T18:08:10.070Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m relu -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 1 -p 0 -U 0
[2022-06-29T18:08:10.070Z] Backward Weights RNN vanilla:
[2022-06-29T18:08:10.070Z] Memory access fault by GPU node-1 (Agent handle: 0x559742012550) on address 0x7f7768be4000. Reason: Page not present or supervisor privilege.
[2022-06-29T18:08:10.070Z] Aborted (core dumped)
[2022-06-29T18:08:10.070Z] test/CMakeFiles/test_rnn_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_rnn_extra' failed
[2022-06-29T18:08:10.070Z] make[7]: *** [test/CMakeFiles/test_rnn_extra] Error 134
[2022-06-29T18:08:10.070Z] CMakeFiles/Makefile2:12913: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/all' failed
[2022-06-29T18:08:10.070Z] make[6]: *** [test/CMakeFiles/test_rnn_extra.dir/all] Error 2
[2022-06-29T18:08:10.070Z] CMakeFiles/Makefile2:12920: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-06-29T18:08:10.071Z] make[5]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-06-29T18:08:10.071Z] Makefile:2309: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-06-29T18:08:10.071Z] make[4]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-06-29T18:08:10.071Z]
[2022-06-29T18:08:10.071Z] Start 73: test_gru_extra
[2022-06-29T18:09:03.794Z] 58/104 Test #73: test_gru_extra ........................................ Passed 50.99 sec
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/dfeng_int8_quantization_api/2/pipeline/1554
log info
NODE_NAME = rocm-framework-19.amd.com
[2022-06-28T20:36:58.595Z] 64/106 Test #101: test_conv_ck_igemm_fwd_v6r1_dlops_nchw .................***Failed 28.99 sec
[2022-06-28T20:36:58.595Z] [ 2%] Built target sqlite_memvfs
[2022-06-28T20:36:58.595Z] [ 2%] Built target addkernels
[2022-06-28T20:36:58.595Z] [100%] Built target MIOpen
[2022-06-28T20:36:58.595Z] [100%] Built target test_conv2d
[2022-06-28T20:36:58.595Z] Scanning dependencies of target test_conv_ck_igemm_fwd_v6r1_dlops_nchw
[2022-06-28T20:36:58.595Z] /home/jenkins/workspace/Open_dfeng_int8_quantization_api/build/bin/test_conv2d --half --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 128 1024 14 14 --weights 2048 1024 1 1 --batch_size 128 --input_channels 1024 --output_channels 2048 --spatial_dim_elements 14 14 --filter_dims 1 1 --pads_strides_dilations 0 0 2 2 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --tensor_vect 0 --vector_length 1
[2022-06-28T20:36:58.595Z] Memory access fault by GPU node-2 (Agent handle: 0x80e5a0) on address 0x7f8f694e8000. Reason: Page not present or supervisor privilege.
[2022-06-28T20:36:58.595Z] Aborted (core dumped)
[2022-06-28T20:36:58.595Z] test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw.dir/build.make:57: recipe for target 'test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw' failed
[2022-06-28T20:36:58.595Z] make[7]: *** [test/CMakeFiles/test_conv_ck_igemm_fwd_v6r1_dlops_nchw] Error 134
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_1549/7/pipeline/1640
log info
NODE_NAME = ixt-sjc2-22
[2022-06-28T10:19:22.347Z] 58/107 Test #72: test_gru_extra .........................................***Failed 13.69 sec
….
[2022-06-28T10:19:22.348Z] ../bin/test_gru --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-hx --no-dhy --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --batch-seq 32 32 32
[2022-06-28T10:19:22.349Z] error: 4.26209e-08
[2022-06-28T10:19:22.349Z] Max diff: 2.98023e-08
[2022-06-28T10:19:22.349Z] Mismatch at 1: -0.0144987 != -0.0144987
[2022-06-28T10:19:22.349Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m gru -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 0 -p 0
[2022-06-28T10:19:22.349Z] inputMode: 0 biasMode: 0 dirMode: 0
[2022-06-28T10:19:22.349Z] hz: 128 batch_n: 96 seqLength: 3 inputLen: 128 numLayers: 1
[2022-06-28T10:19:22.349Z] Forward Inference GRU:
[2022-06-28T10:19:22.349Z] Output tensor output failed verification.
[2022-06-28T10:19:22.349Z] Memory access fault by GPU node-1 (Agent handle: 0x55a84333b040) on address 0x7f13174ea000. Reason: Page not present or supervisor privilege.
[2022-06-28T10:19:22.349Z] Aborted (core dumped)
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/fix_1549/9/pipeline/625
log info
NODE_NAME = ixt-sjc2-22
[2022-06-29T06:55:33.965Z] 57/107 Test #71: test_rnn_extra .........................................***Failed 73.52 sec
../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-dhx --use-dropout 0 --in-mode 0 --bias-mode 1 --dir-mode 0 --rnn-mode 1 --batch-seq 32 32 32
[2022-06-29T06:55:33.972Z] error: 4.23637e-09
[2022-06-29T06:55:33.972Z] Max diff: 8.34465e-07
[2022-06-29T06:55:33.972Z] Mismatch at 4: 0.404517 != 0.404518
[2022-06-29T06:55:33.972Z] ./bin/MIOpenDriver rnn -n 32,32,32 -m tanh -k 3 -H 128 -W 128 -l 1 -F 0 -r 0 -b 1 -p 0 -U 0
[2022-06-29T06:55:33.972Z] Backward Weights RNN vanilla:
[2022-06-29T06:55:33.972Z] Memory access fault by GPU node-1 (Agent handle: 0x560ccf50e640) on address 0x7fb3233b0000. Reason: Page not present or supervisor privilege.
[2022-06-29T06:55:33.972Z] Aborted (core dumped)
@atamazov FYI
I'll try to look into this.
http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/688/pipeline/1641
log info
Start 71: test_rnn_extra
[2022-07-05T02:22:03.245Z] 57/107 Test #71: test_rnn_extra .........................................***Failed 1.04 sec
[2022-07-05T02:22:03.245Z] [ 2%] Built target sqlite_memvfs
[2022-07-05T02:22:03.245Z] [ 2%] Built target addkernels
[2022-07-05T02:22:03.245Z] [ 97%] Built target MIOpen
[2022-07-05T02:22:03.245Z] [100%] Built target test_rnn_vanilla
[2022-07-05T02:22:03.245Z] Scanning dependencies of target test_rnn_extra
[2022-07-05T02:22:03.245Z] MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
[2022-07-05T02:22:03.245Z] MIOpen(HIP): Info [Handle] stream: 0, device_id: 0
[2022-07-05T02:22:03.245Z] ../bin/test_rnn_vanilla --float --batch-size 32 --seq-len 3 --vector-len 128 --hidden-size 128 --num-layers 1 --no-hx --use-dropout 0 --in-mode 0 --bias-mode 0 --dir-mode 0 --rnn-mode 0 --batch-seq 32 32 32
[2022-07-05T02:22:03.245Z] Memory access fault by GPU node-1 (Agent handle: 0x227a460) on address 0x7f5059ff2000. Reason: Page not present or supervisor privilege.
[2022-07-05T02:22:03.245Z] Aborted (core dumped)
[2022-07-05T02:22:03.245Z] test/CMakeFiles/test_rnn_extra.dir/build.make:57: recipe for target 'test/CMakeFiles/test_rnn_extra' failed
[2022-07-05T02:22:03.245Z] make[7]: *** [test/CMakeFiles/test_rnn_extra] Error 134
[2022-07-05T02:22:03.245Z] CMakeFiles/Makefile2:12926: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/all' failed
[2022-07-05T02:22:03.245Z] make[6]: *** [test/CMakeFiles/test_rnn_extra.dir/all] Error 2
[2022-07-05T02:22:03.245Z] CMakeFiles/Makefile2:12933: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-07-05T02:22:03.245Z] make[5]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-07-05T02:22:03.245Z] Makefile:2234: recipe for target 'test/CMakeFiles/test_rnn_extra.dir/rule' failed
[2022-07-05T02:22:03.245Z] make[4]: *** [test/CMakeFiles/test_rnn_extra.dir/rule] Error 2
[2022-07-05T02:22:03.245Z]
[2022-07-05T02:22:03.245Z] Start 72: test_gru_extra
[2022-07-05T02:24:31.186Z] 58/107 Test #72: test_gru_extra ......................................... Passed 138.00 sec
[2022-07-05T02:24:31.187Z] Start 73: test_lstm_extra
It looks like the issue has been solved, let me close this issue. Please feel free to re-open it if not resolved yet.
Not fixed!
Not fixed!
Sorry for that. Also pin it back.
So is there a way to solve this problem?
@shurale-nkn Is it possible to reliably reproduce the issue?
@shurale-nkn Is this fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!