Memory access fault - page not present or supervisor privilege, gfx1031 with HSA_OVERRIDE_GFX_VERSION=10.3.0
On RX 6850M XT [gfx1031] with HSA_OVERRIDE_GFX_VERSION=10.3.0 Gentoo, HIP version 6.3.42134, MIOpen version 3.3.0
Met with the error by running: https://github.com/HomebrewML/HeavyBall/blob/e8e44c2594230a59508d64830ed9af1732411f8f/examples/soap.py
Minimal reproduction:
MIOPEN_FIND_ENFORCE=3 HSA_OVERRIDE_GFX_VERSION=10.3.0 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
Error:
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
PRNG seed: 12345678
Memory access fault by GPU node-1 (Agent handle: 0x55724b0d00b0) on address 0x7fca1be00000. Reason: Page not present or supervisor privilege.
Failed to fetch queues snapshot.
GPU core dump failed
[1] 946669 IOT instruction (core dumped) MIOPEN_FIND_ENFORCE=3 HSA_OVERRIDE_GFX_VERSION=10.3.0 HIP_VISIBLE_DEVICES=0
Full error log with debug env variables: https://gist.githubusercontent.com/sozforex/6babbda6cacea2734e225e1a63ee7ae2/raw/c597b59d11062298b61474fb7c77f0b90764bb26/gfx1030_miopen_conv_error
Running the reproduction command with MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_FIND_ENFORCE=3 I think saves a different result in "miopen find database" and allows one to get around the problem.
Hi @sozforex. Internal ticket has been created to investigate this issue. Thanks!
~This issue is possibly related to https://github.com/ROCm/MIOpen/issues/1431 - I see -mwavefrontsize64 in debug logs, but wavesize should be 32 for my GPU [will try to do something with it and will report the result].~
EDIT: this GPU supports two different modes - WGP mode with -mno-wavefrontsize64 [wavefront size 32] and CU mode with -mwavefrontsize64 -mcumode.
Running the same soap.py, after getting around the previous error stumbled on a similar error
Similar to previous reproduction, but with -F 4 at the end
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
Here is an error log: https://gist.githubusercontent.com/sozforex/65e7fa023e1ba163ff6bdd81094989d8/raw/bfca7cc804d9a84868a732bff5bdcfcec6bbf0fc/gfx1030_miopen_conv_error2.txt
For this error, running the reproduction command with MIOPEN_DEBUG_CONV_DIRECT=0 MIOPEN_FIND_ENFORCE=3 allowed me to get around the problem.
Hi @sozforex. Your gpu is not a gfx1030, it is a gfx1031. Also it is not on the list of supported devices [1][2]. Please correct the title. Both gpus share the same ISA, but they have technical differences. HSA_OVERRIDE_GFX_VERSION allows one device to be detected as another, but it won't add more compute units or missing instructions. This may work somehow in some cases under certain circumstances.
BTW the library has universal kernels that theoretically can run on any hardware. Have you tried running it without HSA_OVERRIDE_GFX_VERSION?
Please also provide rocminfo output.
Hi @averinevg, I'm aware that it is not on the list of supported devices - I do not have AMD Radeon PRO W6800 or AMD Radeon PRO V620 to test if this memory access fault can be reproduced on them.
I've tried running without HSA_OVERRIDE_GFX_VERSION [with full rocm compiled with both gfx1030 and gfx1031], I get the same errors.
rocminfo output
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 2
*******
Name: gfx1031
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 6800M
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 3072(0xc00) KB
L3: 98304(0x18000) KB
Chip ID: 29663(0x73df)
ASIC Revision: 0(0x0)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 2530
BDFID: 768
Internal Node ID: 1
Compute Unit: 40
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 122
SDMA engine uCode:: 80
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 12566528(0xbfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1031
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Hi @sozforex, Since your hardware is not officially supported, the only solution in your case is the approach "try and disable everything that doesn't work." The logs show that in your case, the GEMM and some direct algorithms are not working. To disable them, you need to use the following environment variables:
MIOPEN_DEBUG_CONV_GEMM=0 MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW53=0
As I see, you are already familiar with them, but instead of disabling all direct algorithms, you can disable only those that are failing. Please try and provide feedback.
Tested this on a W6800 on the rocm-6.3.3 tag of MIOpen and I can't reproduce it.
MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 ./MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
PRNG seed: 12345678
MIOpen(HIP): Warning [GetAllConfigs] ConvBinWinogradRxSf3x2: Searching the best solution among 60...
MIOpen(HIP): Warning [GenericSearch] Done: 60/0/60, best #44 3.93031 43
MIOpen(HIP): Warning [GenericSearch] ...Score: 1.20375 (default time 4.73111)
MIOpen(HIP): Warning [SearchImpl] Searching the best solution in the 9 dim space. Please, be patient...
MIOpen(HIP): Warning [SearchImpl] Runs left: 863, min time so far: 8.77174, curr time: 8.77174 16,16,16,16,1,1,1,1,1
MIOpen(HIP): Warning [SearchImpl] Runs left: 813, min time so far: 3.40147, curr time: 3.95063 8,8,16,16,2,2,1,1,2
MIOpen(HIP): Warning [SearchImpl] Runs left: 744, min time so far: 3.19795, curr time: 3.80195 8,32,16,32,2,1,2,1,1
MIOpen(HIP): Warning [SearchImpl] Runs left: 675, min time so far: 3.19795, curr time: 4.19203 32,8,32,16,1,2,4,1,1
MIOpen(HIP): Warning [SearchImpl] Runs left: 624, min time so far: 3.19795, curr time: 3.55043 8,16,32,16,4,1,4,1,2
MIOpen(HIP): Warning [SearchImpl] Runs left: 523, min time so far: 3.19795, curr time: 3.76435 16,16,32,32,2,2,8,1,2
MIOpen(HIP): Warning [SearchImpl] Runs left: 468, min time so far: 3.19795, curr time: 3.87331 8,8,32,32,4,4,2,1,1
MIOpen(HIP): Warning [SearchImpl] Default run, min time so far: 3.19795, default time: 3.80835 16,16,32,32,2,2,8,2,1
MIOpen(HIP): Warning [SearchImpl] ...Score: 1.19087
MIOpen Backward Data Conv. Algorithm: 3, Solution: 84/ConvBinWinogradRxSf2x3g1
GPU Kernel Time Backward Data Conv. Elapsed: 3.051616 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv5x5u1, 1024, 256, 5, 5, 1, 32, 32, 13421772800, 1074176, 4194304, 4398, 2, 3.051616
Backward Convolution Data Verifies OK on GPU reference (1.94073e-08 < 1.5e-06)
Have you tried building MIOpen for gfx1031 specifically instead of using the arch override?
@averinevg, thank you. When I've looked for env variables to disable a smaller subset of algorithms, I've tried only some of those listed in https://github.com/ROCm/MIOpen/blob/develop/docs/how-to/debug-log.rst [and lacking understanding missed the last two you've listed].
MIOPEN_DEBUG_CONV_GEMM=0
helps with
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0
helps with
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
The above two env variables are sufficient when running soap.py to not to get memory access fault errors on my GPU, thanks.
Have not stumbled yet on a case where MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW53=0 may be needed.
Have you tried building MIOpen for gfx1031 specifically instead of using the arch override?
Hi @LunNova , thanks for testing this on an actual gfx1030. I've tried with and without an override with full rocm [including MIOpen] compiled with both gfx1030 and gfx1031 at the same time [either rocm-6.3.1 or rocm-6.3.2], and separately with an override with rocm-6.3.3 compiled for gfx1030.
Oh, not full rocm - I'm using llvm/clang-19.1.7 [on Gentoo] instead of AOCC or the version of llvm that comes with official rocm releases.
Just in case, tried this again with rocm-6.3.3 [including rocBLAS, Tensile and MIOpen] compiled only with gfx1031 [without gfx1030].
MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
without MIOPEN_DEBUG_CONV_GEMM=0
results in rocBLAS/Tensile error:
...
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file "/home/user/.cache/miopen/3.3.0./gfx1031_20.ukdb"
MIOpen(HIP): Info2 [KernDb] Database created successfully
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: "Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o"; args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031
MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o') AND (kernel_args = '-Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031');
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.738709 ms
MIOpen(HIP): Info2 [LoadBinary] Successfully loaded binary for: "Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o"; args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: MIOpenConvUni
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = MIOpenConvUni, global_work_dim = { 256, 64, 1024 }, local_work_dim = { 128, 1, 1 }
MIOpen(HIP): Info [EvaluateInvokers] ConvOclDirectFwd: MIOpenConvUni: 3.17666 < 3.40282e+38
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: naive_conv_ab_nonpacked_bwd_nchw_float_double_float
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_ab_nonpacked_bwd_nchw_float_double_float, global_work_dim = { 67108864, 1, 1 }, local_work_dim = { 256, 1, 1 }
MIOpen(HIP): Info [EvaluateInvokers] ConvDirectNaiveConvBwd: naive_conv_ab_nonpacked_bwd_nchw_float_double_float: 131.582 >= 3.17666
MIOpen(HIP): Info2 [Register] Invoker registered for algorithm 1x32x32x5x5x256x32x32x1024xNCHWxFP32x2x2x1x1x1x1x1xBxDefault and solver ConvOclDirectFwd
MIOpen(HIP): Info2 [SetAsFound1_0] Solver ConvOclDirectFwd registered as find 1.0 best for miopenConvolutionBwdDataAlgoDirect in 1x32x32x5x5x256x32x32x1024xNCHWxFP32x2x2x1x1x1x1x1xBxDefault
MIOpen(HIP): Info [EvaluateInvokers] Selected: ConvOclDirectFwd: MIOpenConvUni: 3.17666, workspace_sz = 0
MIOpen(HIP): auto miopen::solver::conv::GemmBwdRest::GetSolution(const ExecutionContext &, const ProblemDescription &)::(anonymous class)::operator()(const std::vector<Kernel> &)::(anonymous class)::operator()(const Handle &, const AnyInvokeParams &) const{
MIOpen(HIP): "convolution, non 1x1" = convolution, non 1x1
MIOpen(HIP): }
MIOpen(HIP): Info2 [CallGemm] gemm_desc: {isColMajor 0, transA 1, transB 0, m 6400, n 1024, k 1, lda 6400, ldb 1024, ldc 1024, batch_count 1, strideA 0, strideB 0, strideC 0, alpha 1, beta 0, dataType float, a_cast_type float, b_cast_type float}
MIOpen(HIP): Info2 [CallGemm] rocBLAS
rocBLAS error: Cannot read /usr/lib64/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1030
List of available TensileLibrary Files :
"/usr/lib64/rocblas/library/TensileLibrary_lazy_gfx1031.dat"
[1] 2159995 IOT instruction (core dumped) MIOPEN_FIND_ENFORCE=3 MIOPEN_LOG_LEVEL=6 MIOPEN_ENABLE_LOGGING=1 =1 =0 conv
The second command
MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
without MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 results in the same memory access fault as described previously.
Running it with MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0, results in a rocBLAS/Tensile error of the same kind as above.
rocBLAS error: Cannot read /usr/lib64/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1030
List of available TensileLibrary Files :
"/usr/lib64/rocblas/library/TensileLibrary_lazy_gfx1031.dat"
[1] 2160815 IOT instruction (core dumped) MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 MIOPEN_FIND_ENFORCE=3 MIOPEN_LOG_LEVEL=5
With both MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 and MIOPEN_DEBUG_CONV_GEMM=0 it runs to completion without errors.
Tested this on a W6800 on the rocm-6.3.3 tag of MIOpen and I can't reproduce it.
Hi @LunNova, could you please check again with MIOPEN_DEBUG_CONV_WINOGRAD=0?
$ MIOPEN_DEBUG_CONV_WINOGRAD=0 MIOPEN_FIND_ENFORCE=3 ./MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1
PRNG seed: 12345678
MIOpen Backward Data Conv. Algorithm: 1, Solution: 11/ConvOclDirectFwd
GPU Kernel Time Backward Data Conv. Elapsed: 3.381792 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv5x5u1, 1024, 256, 5, 5, 1, 32, 32, 13421772800, 1074176, 4194304, 3969, 2, 3.381792
Backward Convolution Data Verifies OK on GPU reference (1.74561e-08 < 1.5e-06)
@LunNova Thank you. Could you also please check MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 with the same env variable?
Just in case, tried this again with rocm-6.3.3 [including rocBLAS, Tensile and MIOpen] compiled only with gfx1031 [without gfx1030].
... rocBLAS error: Cannot read /usr/lib64/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1030 List of available TensileLibrary Files : "/usr/lib64/rocblas/library/TensileLibrary_lazy_gfx1031.dat" ...
Hi @sozforex, thank you for your research. This error comes from the depths of the rocBLAS. MIOpen uses it for some operations. I see that it is built for gfx1031, but for some reason it detects your gpu as a gfx1030. Did you use HSA_OVERRIDE_GFX_VERSION?
Did you use
HSA_OVERRIDE_GFX_VERSION?
Hi @averinevg, I've run it without the HSA_OVERRIDE_GFX_VERSION [I've checked that it is unset].
You can see -mcpu=gfx1031 in this part of the log above:
... MIOpen(HIP): Info2 [LoadBinary] Loading binary for: "Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o"; args: -Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031 MIOpen(HIP): Info2 [Prepare] SELECT kernel_blob, kernel_hash, uncompressed_size FROM kern_db WHERE (kernel_name = 'Conv_Winograd_v30_3_1_fp32_f2x3_stride1.s.o') AND (kernel_args = '-Wa,-defsym,ROCM_METADATA_VERSION=5 -Wa,-defsym,FORCE_CACHE_BYPASS_ON_STORE=0 -mcumode -mwavefrontsize64 -mcpu=gfx1031'); ...
I remember now that Gentoo patches rocBLAS and Tensile to extend compatibility: https://github.com/gentoo/gentoo/blob/master/sci-libs/rocBLAS/files/rocBLAS-6.0.2-expand-isa-compatibility.patch https://github.com/gentoo/gentoo/blob/master/dev-util/Tensile/files/Tensile-6.0.2-expand-isa-compatibility.patch
These compatibility extending patches may not work as intended when those packages are compiled with gfx1031 but without gfx1030. [I'm not really sure]
MIOPEN_DEBUG_CONV_WINOGRAD=0 MIOPEN_FIND_ENFORCE=3 ./MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
PRNG seed: 12345678
MIOpen(HIP): Warning [GetAllConfigs] ConvOclBwdWrW2<2>: Searching the best solution among 378...
MIOpen(HIP): Warning [Monitor] 0/0/378 2407, best within recent 1: 2407 #0 1,7,1,1,11, ETA:0 sec.
<snipped lots of similar lines>
MIOpen(HIP): Warning [GenericSearch] Done: 36/0/36, best #7 460.478 64,64,64,16,4,4
MIOpen(HIP): Warning [GenericSearch] ...Score: 1.0087 (default time 464.483)
MIOpen Backward Weights Conv. Algorithm: 1, Solution: 24/ConvOclBwdWrW53
GPU Kernel Time Backward Weights Conv. Elapsed: 272.043365 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv5x5u1, 1024, 256, 32, 32, 5, 5, 256, 3435973836800, 0, 0, 12630, 0, 272.043365
Backward Convolution Weights FAILED: 3.64114e-06 > 3e-06
I have not noticed it previously [as it is not an exception], but I get the same Backward Convolution Weights FAILED: 3.64114e-06 > 3e-06 result when running with or without the override [this time with rocm compiled with both gfx1030 and gfx1031] and with MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0.
MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 MIOPEN_FIND_ENFORCE=3 HIP_VISIBLE_DEVICES=0 MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1
PRNG seed: 12345678
MIOpen Backward Weights Conv. Algorithm: 1, Solution: 24/ConvOclBwdWrW53
GPU Kernel Time Backward Weights Conv. Elapsed: 416.347260 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv5x5u1, 1024, 256, 32, 32, 5, 5, 256, 3435973836800, 0, 0, 8253, 0, 416.347260
Backward Convolution Weights FAILED: 3.64114e-06 > 3e-06
With MIOPEN_DEBUG_CONV_WINOGRAD=0 instead of MIOPEN_DEBUG_CONV_DIRECT_OCL_WRW2=0 I get a memory access fault.
@sozforex @LunNova Thank you, guys!
We can summarize some results.
MIOPEN_FIND_MODE=normal MIOPEN_DEBUG_FIND_ONLY_SOLVER='GemmBwdRest' MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 1 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 2 -t 1 fails with "Memory access fault by GPU node-1" or "No such file or directory for GPU arch" (with a custom build for gfx1031 with those patches). I expect that this command might also fail on the gfx1030.
If you change GemmBwdRest to ConvOclDirectFwd, ConvBinWinogradRxSf3x2, ConvBinWinogradRxSf2x3g1 or ConvDirectNaiveConvBwd in the command above, it will work.
MIOPEN_FIND_MODE=normal MIOPEN_DEBUG_FIND_ONLY_SOLVER='ConvOclBwdWrW2<2>' MIOpenDriver conv -n 1024 -c 256 -H 32 -W 32 -k 256 -y 5 -x 5 -p 2 -q 2 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 fails with "Memory access fault by GPU node-1". I expect that this command might also fail on the gfx1030.
If you change ConvOclBwdWrW2<2> to ConvOclBwdWrW2<4>, ConvOclBwdWrW2<8> or ConvOclBwdWrW2<16>, it will also fail.
If you change it to ConvOclBwdWrW53, there will be a validation error.
If you change it to ConvBinWinogradRxSf3x2, ConvBinWinogradRxSf2x3g1, ConvDirectNaiveConvWrw, ConvHipImplicitGemmV4R1WrW or ConvHipImplicitGemmV4R4WrW, it will work.
Please check this. I'm waiting for your feedback. If something that is supposed to work suddenly doesn't, please provide the full logs with the maximum logging level.
@sozforex Please also try with HSA_OVERRIDE_GFX_VERSION=10.3.0, in this case, there shouldn't be any difference.
@averinevg tested, all as you say in the summary. On rocm compiled with both gfx1030 and gfx1031, the override does not change results.
Hi @sozforex. Do you still need further assistance with this ticket if not please close the ticket. Thanks!