MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

[BUG][GFX1030] GemmFwdRest: memory access fault

Open Slimakanzer opened this issue 2 years ago • 12 comments

export MIOPEN_FIND_MODE=1
export MIOPEN_DEBUG_FIND_ONLY_SOLVER=GemmFwdRest

./bin/MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0
MIOpen log
MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
MIOpen(HIP): Info [Handle] stream: 0x7e53c0, device_id: 0
MIOpen(HIP): Info [GetFindModeValueImpl] MIOPEN_FIND_MODE = NORMAL(1)
MIOpen(HIP): Info [ForwardGetWorkSpaceSize] 
MIOpen(HIP): Info [AmdRocmMetadataVersionDetect] ROCm MD version AMDHSA_COv3, HIP version 5.1.22103, MIOpen version 2.18.0.8b7e20c0d
MIOpen(HIP): Info [GetEnvFindOnlySolverImpl] 91
MIOpen(HIP): Info2 [GetWorkspaceSizes] GemmFwdRest: 7120659456
MIOpen(HIP): Info2 [ForwardGetWorkSpaceSize] 7120659456
MIOpen(HIP): Info [FindConvFwdAlgorithm] requestAlgoCount = 2, workspace = 7120659456
MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.016902 ms
MIOpen(HIP): Info2 [ValidateUnsafe] DB file is older than cache: 8110861832927, 8226172067290
MIOpen(HIP): Info2 [FindRecordUnsafe] Looking for key 128-1760-1760-3x3-16-1758-1758-1-0x0-1x1-1x1-0-NCHW-FP16-F in cache for file /git/MIOpen/src/kernels/gfx1030_40.HIP.2_18_0.ufdb.txt
MIOpen(HIP): Info2 [Measure] Db::FindRecord time: 0.029677 ms
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 128x1760x1760x3x3x16x1758x1758x1xNCHWxFP16x0x0x1x1x1x1x1xF and solver ConvOclDirectFwd
MIOpen(HIP): Info2 [LogFindDbItem] Kernel cache entry not found for solver <miopenConvolutionFwdAlgoDirect::ConvOclDirectFwd> at network config: 128-1760-1760-3x3-16-1758-1758-1-0x0-1x1-1x1-0-NCHW-FP16-F and kernel cache key: miopenConvolutionFwdAlgoDirect, <unused>
MIOpen(HIP): Info2 [LogFindDbItem] Find-db record content: <miopenConvolutionFwdAlgoDirect::ConvOclDirectFwd> at network config: <unused> and algorithm name: miopenConvolutionFwdAlgoDirect
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info2 [GetPerfDbPathFile] inexact perf database search
MIOpen(HIP): Info2 [GetPerfDbPathFile] Iterating over perf db directory /git/MIOpen/src/kernels
MIOpen(HIP): Info2 [GetPerfDbPathFile] Checking perf db file: gfx1030_36
MIOpen(HIP): Info2 [GetPerfDbPathFile] Updating best candidate to: /git/MIOpen/src/kernels/gfx1030_36.db
MIOpen(HIP): Info2 [SQLiteBase] Initializing system database file /git/MIOpen/src/kernels/gfx1030_36.db
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(config);
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(perf_db);
MIOpen(HIP): Info2 [SQLiteBase] Initializing user database file /git/MIOpen/src/kernels/gfx1030_40_1.1.0.udb
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA journal_mode=WAL;
MIOpen(HIP): Trace [Exec] 140260181409984:SELECT name FROM sqlite_master WHERE type = 'table' AND (name = 'config');
MIOpen(HIP): Trace [Exec] 140260181409984:SELECT name FROM sqlite_master WHERE type = 'table' AND (name = 'perf_db');
MIOpen(HIP): Trace [SQLitePerfDb] Database created successfully
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(config);
MIOpen(HIP): Trace [Exec] 140260181409984:PRAGMA table_info(perf_db);
MIOpen(HIP): Info [FindSolutionImpl] GemmFwdRest (not searchable)
MIOpen(HIP): Info2 [SearchForAllSolutions] GemmFwdRest: Success.
MIOpen(HIP): Info2 [Log] PrecompileKernels Compile Time, ms: 0.25293
MIOpen(HIP): Info2 [GetKernels] 0 kernels for key: miopenIm2d2Col "c128i1760_1760w3_3p0_0s1_1d1_1t0"
MIOpen(HIP): Info2 [AddKernel] Key: miopenIm2Col "c128i1760_1760w3_3p0_0s1_1d1_1t0"
MIOpen(HIP): Info [PrintVersionImpl] COMgr v.2.4.0, USE_HIP_PCH: 1
MIOpen(HIP): Info2 [SetIsaName] amdgcn-amd-amdhsa--gfx1030
MIOpen(HIP): Info2 [Log] Kernel MIOpenIm2d2Col.cl Compile Time, ms: 208.224
Memory access fault by GPU node-1 (Agent handle: 0x7a6e70) on address 0x7f8bc0cab000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Slimakanzer avatar Jul 05 '22 10:07 Slimakanzer

@Slimakanzer Does this error occur every time ? Please describe the settings of this system. (ROCm or Docker..)

FYI @atamazov @junliume #1613

shurale-nkn avatar Jul 05 '22 18:07 shurale-nkn

MIOpen(HIP): Info2 [GetWorkspaceSizes] GemmFwdRest: 7120659456

Most likely, the reason is that workspace size is > 4 Gb which exceeds the 32-bit addressing used in Im2col kernels.

The simplest proposed fix is limiting applicability of the solvers that use Im2col/col2im machinery.

If rocblas is able to handle >4GB tensors, then it may worth implementing 64-bit Im2col/col2im and use it when necessary.

atamazov avatar Jul 05 '22 19:07 atamazov

@Slimakanzer Can we verify if the following W/A works?

The simplest proposed fix is limiting applicability of the solvers that use Im2col/col2im machinery.

junliume avatar Jul 20 '22 18:07 junliume

@Slimakanzer Can we verify if the following W/A works?

The simplest proposed fix is limiting applicability of the solvers that use Im2col/col2im machinery.

@junliume I can verify and I will create a fix PR.

shurale-nkn avatar Jul 23 '22 14:07 shurale-nkn

@junliume The full-blown fix which @shurale-nkn is developing is the best solution for the problem.

atamazov avatar Jul 25 '22 14:07 atamazov

@DrizztDoUrden and @atamazov : we are having some issues with Fp32 OpenCL Debug gfx1030 stage

[2022-07-25T20:02:37.066Z] MIOpen(OpenCL): Error [BuildProgram] Build log: /tmp/comgr-45f8cd/input/CompileSource:106:5: error: 'USE_LARGE_BUFFER_INDEX' is not defined, evaluates to 0 [-Werror,-Wundef]

[2022-07-25T20:02:37.066Z] #if USE_LARGE_BUFFER_INDEX

[2022-07-25T20:02:37.066Z]     ^

[2022-07-25T20:02:37.066Z] 1 error generated.

[2022-07-25T20:02:37.066Z] Error: Failed to compile source (from CL or HIP source to LLVM IR).

junliume avatar Jul 25 '22 20:07 junliume

Should be #ifdef instead of #if?

dmikushin avatar Jul 25 '22 20:07 dmikushin

@junliume @dmikushin #1644 is draft (that's why I said "is developing") and not expected to work 😄

atamazov avatar Jul 25 '22 20:07 atamazov

It looks like the bug fix PR #1644 and extended PR #1657 has been merged, let me close this issue.

aska-0096 avatar Aug 04 '22 02:08 aska-0096

@junliume @aska-0096 The regression test is not added in #1644. Please re-open this with testing, urgency_normal.

atamazov avatar Aug 04 '22 14:08 atamazov

@atamazov Is the regression test completed so we can close this ticket? Thanks!

ppanchad-amd avatar Apr 16 '24 18:04 ppanchad-amd

@ppanchad-amd I do not think so because otherwise this ticket would contain a link to the PR implementing the regression test.

atamazov avatar Apr 17 '24 12:04 atamazov

Run the program successfully with ROCm6.2.2, no core dump. @atamazov are we able to close the ticket: rocm22@rocm22:/opt/rocm-6.2.2/bin$ ./MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0 MIOpenDriver convfp16 -W 1760 -H 1760 -c 128 -n 1 -k 16 -x 3 -y 3 -p 0 -q 0 -u 1 -v 1 -F 1 -t 1 -V 0 PRNG seed: 12345678 MIOpen Forward Conv. Algorithm: 0, Solution: 91/GemmFwdRest GPU Kernel Time Forward Conv. Elapsed: 23.399267 ms (average) stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs stats: fwd-conv3x3u1, 1, 128, 1758, 1758, 3, 3, 16, 113930551296, 793022464, 98898048, 4869, 38, 23.399267

huanrwan-amd avatar Oct 09 '24 21:10 huanrwan-amd