MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

[Find Db] 3d convolutions with NCHW layout

Open wenobug opened this issue 4 years ago • 7 comments

The following in gfx906_64.HIP.fdb.txt is 3d convolutions,but using NCHW layout. image

Thus I parse the config to MIOpenDriver command, then i got an error as follows:

what():  /xxx/code/miopen-3x3-1x1-verfiy/driver/tensor_driver.hpp:124: unmatched layout and dimension size

wenobug avatar Dec 14 '21 03:12 wenobug

@zpwenjh Is it trivial to reproduce?

atamazov avatar Dec 15 '21 22:12 atamazov

This may affect performance of 3D conv, but who knows which other side effects the defect has?

atamazov avatar Dec 15 '21 22:12 atamazov

@averinevg Please have a look when you have time. This doesn't seem urgent (until we do not know the root reason at least))

atamazov avatar Dec 17 '21 22:12 atamazov

It seems like system fdb is corrupt, the padding field should look like AxBxC and NCDHW should be used instead of NCHW,

  1. Reproduce error with MIOpenDriver (decode 1024-14-14-1x1x1-512... into driver options)
  2. Make sure the current develop encodes 3d conv problem config into valid db keys:
    • Export MIOPEN_FIND_MODE=1
    • Run the same MIOpenDriver command, there should be no failures.
    • Look into user-find-db (ufdb.txt) and check that keys are correct. Fix library if necessary.
  3. Reassign this to @JehandadKhan and @cderb

How to decode. 3D conv command line example (Forward):

./bin/MIOpenDriver conv -n n -c c --in_d iD -H H -W W -k k --fil_d fD -y Y -x X --pad_d pD -p P -q Q \
--conv_stride_d sD -u U -v V --dilation_d dD -l L -j J --spatial_dim 3 \
--in_layout NCHW --out_layout NCHW --fil_layout NCHW \
-F 1 ...

Db Key format:

c-iD-H-W-fDxYxX-k-oD-oH-oW-n-pDxPxQ-sDxVxU-dDxLxJ-b-NCHW-FP32-F

NCHW seems questionable; maybe it should be NCDHW.

atamazov avatar Dec 20 '21 18:12 atamazov

@averinevg Did you have a chance to look at this issue? Thanks!

ppanchad-amd avatar Apr 15 '24 20:04 ppanchad-amd

@atamazov @ppanchad-amd Incorrect records are only present in databases for gfx803, gfx900 and gfx906. Databases for newer GPUs are fine. The format of the record was changed (fixed) in https://github.com/ROCm/MIOpen/pull/343. It is likely that records in the databases were added before this change. Now these records are useless as they are simply ignored. We can fix them or just remove. I'm not sure that it is useful to have such old records (and databases), since the compiler, runtime and the library have most likely changed a lot during this time.

averinevg avatar Apr 16 '24 12:04 averinevg

Now these records are useless as they are simply ignored.

So the bug is non-fatal and impacts only performance of 3D convolutions, especially for configs where ConvHipImplicitGemmV4R4Fwd, ConvHipImplicitGemmBwdDataV4R1 or ConvHipImplicitGemmV4R4WrW are applicable (for others we have only two 3D solvers -- GEMM and Naive, -- where heuristics should work quite fine).

In order to fix the issue, find-db records for 3D conv should be regenerated for gfx806; gfx803 and gfx900 are deprecated (with limited support) and thus fix is not required for them.

I recommend setting https://github.com/ROCm/MIOpen/labels/IMPACTS_FIND_DB and https://github.com/ROCm/MIOpen/labels/performance (other labels look good) and assigning to Tuna team.

/cc @junliume @JehandadKhan @alexandraBara @cderb @averinevg @ppanchad-amd

atamazov avatar Apr 17 '24 12:04 atamazov