MIOpen
MIOpen copied to clipboard
[BUG][GFX1030] GemmFwdRest: memory access fault
export MIOPEN_FIND_MODE=1
export MIOPEN_DEBUG_FIND_ONLY_SOLVER=GemmFwdRest
./bin/MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0
MIOpen log
MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
MIOpen(HIP): Info [Handle] stream: 0x7e53c0, device_id: 0
MIOpen(HIP): Info [GetFindModeValueImpl] MIOPEN_FIND_MODE = NORMAL(1)
MIOpen(HIP): Info [ForwardGetWorkSpaceSize]
MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 5.1.22103, MIOpen version 2.18.0.8b7e20c0d
MIOpen(HIP): Info [GetEnvFindOnlySolverImpl] 91
MIOpen(HIP): Info2 [GetWorkspaceSizes] GemmFwdRest: 7120659456
MIOpen(HIP): Info2 [ForwardGetWorkSpaceSize] 7120659456
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 2, workspace = 7120659456
MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.016902 ms
MIOpen(HIP): Info2 [ValidateUnsafe] DB file is older than cache: 8110861832927, 8226172067290
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 128-1760-1760-3x3-16-1758-1758-1-0x0-1x1-1x1-0-NCHW-FP16-F in cache for file /git/MIOpen/src/kernels/gfx1030_40.HIP.2_18_0.ufdb.txt
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.029677 ms
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 128x1760x1760x3x3x16x1758x1758x1xNCHWxFP16x0x0x1x1x1x1x1xF and solver ConvOclDirectFwd
MIOpen(HIP): Info2 [LogFindDbItem] Kernel cache entry not found for solver <miopenConvolutionFwdAlgoDirect::ConvOclDirectFwd> at network config: 128-1760-1760-3x3-16-1758-1758-1-0x0-1x1-1x1-0-NCHW-FP16-F and kernel cache key: miopenConvolutionFwdAlgoDirect, <unused>
MIOpen(HIP): Info2 [LogFindDbItem] Find-db record content: <miopenConvolutionFwdAlgoDirect::ConvOclDirectFwd> at network config: <unused> and algorithm name: miopenConvolutionFwdAlgoDirect
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info2 [GetPerfDbPathFile] inexact perf database search
MIOpen(HIP): Info2 [GetPerfDbPathFile] Iterating over perf db directory /git/MIOpen/src/kernels
MIOpen(HIP): Info2 [GetPerfDbPathFile] Checking perf db file: gfx1030_36
MIOpen(HIP): Info2 [GetPerfDbPathFile] Updating best candidate to: /git/MIOpen/src/kernels/gfx1030_36.db
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file /git/MIOpen/src/kernels/gfx1030_36.db
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(config);
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(perf_db);
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /git/MIOpen/src/kernels/gfx1030_40_1.1.0.udb
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA journal_mode=WAL;
MIOpen(HIP): Trace [Exec] 140260181409984:SELECT name FROM sqlite_master WHERE type = 'table' AND (name = 'config');
MIOpen(HIP): Trace [Exec] 140260181409984:SELECT name FROM sqlite_master WHERE type = 'table' AND (name = 'perf_db');
MIOpen(HIP): Trace [SQLitePerfDb] Database created successfully
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(config);
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(perf_db);
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info2 [SearchForAllSolutions] GemmFwdRest: Success.
MIOpen(HIP): Info2 [Log] PrecompileKernels Compile Time, ms: 0.25293
MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c128i1760_1760w3_3p0_0s1_1d1_1t0"
MIOpen(HIP): Info2 [AddKernel] Key: miopenIm2Col "c128i1760_1760w3_3p0_0s1_1d1_1t0"
MIOpen(HIP): Info [PrintVersionImpl] COMgr v.2.4.0, USE_HIP_PCH: 1
MIOpen(HIP): Info2 [SetIsaName] amdgcn-amd-amdhsa--gfx1030
MIOpen(HIP): Info2 [Log] Kernel MIOpenIm2d2Col.cl Compile Time, ms: 208.224
Memory access fault by GPU node-1 (Agent handle: 0x7a6e70) on address 0x7f8bc0cab000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
@Slimakanzer Does this error occur every time ? Please describe the settings of this system. (ROCm or Docker..)
FYI @atamazov @junliume #1613
MIOpen(HIP): Info2 [GetWorkspaceSizes] GemmFwdRest: 7120659456
Most likely, the reason is that workspace size is > 4 Gb which exceeds the 32-bit addressing used in Im2col kernels.
The simplest proposed fix is limiting applicability of the solvers that use Im2col/col2im machinery.
If rocblas is able to handle >4GB tensors, then it may worth implementing 64-bit Im2col/col2im and use it when necessary.
@Slimakanzer Can we verify if the following W/A works?
The simplest proposed fix is limiting applicability of the solvers that use Im2col/col2im machinery.
@Slimakanzer Can we verify if the following W/A works?
The simplest proposed fix is limiting applicability of the solvers that use Im2col/col2im machinery.
@junliume I can verify and I will create a fix PR.
@junliume The full-blown fix which @shurale-nkn is developing is the best solution for the problem.
@DrizztDoUrden and @atamazov : we are having some issues with Fp32 OpenCL Debug gfx1030
stage
[2022-07-25T20:02:37.066Z] MIOpen(OpenCL): Error [BuildProgram] Build log: /tmp/comgr-45f8cd/input/CompileSource:106:5: error: 'USE_LARGE_BUFFER_INDEX' is not defined, evaluates to 0 [-Werror,-Wundef]
[2022-07-25T20:02:37.066Z] #if USE_LARGE_BUFFER_INDEX
[2022-07-25T20:02:37.066Z] ^
[2022-07-25T20:02:37.066Z] 1 error generated.
[2022-07-25T20:02:37.066Z] Error: Failed to compile source (from CL or HIP source to LLVM IR).
Should be #ifdef
instead of #if
?
@junliume @dmikushin #1644 is draft (that's why I said "is developing") and not expected to work 😄
It looks like the bug fix PR #1644 and extended PR #1657 has been merged, let me close this issue.
@junliume @aska-0096 The regression test is not added in #1644. Please re-open this with testing
, urgency_normal
.
@atamazov Is the regression test completed so we can close this ticket? Thanks!
@ppanchad-amd I do not think so because otherwise this ticket would contain a link to the PR implementing the regression test.
Run the program successfully with ROCm6.2.2, no core dump. @atamazov are we able to close the ticket:
rocm22@rocm22:/opt/rocm-6.2.2/bin$ ./MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0 MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0 PRNG seed: 12345678 MIOpen Forward Conv. Algorithm: 0, Solution: 91/GemmFwdRest GPU Kernel Time Forward Conv. Elapsed: 23.399267 ms (average) stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: fwd-conv3x3u1, 1, 128, 1758, 1758, 3, 3, 16, 113930551296, 793022464, 98898048, 4869, 38, 23.399267