MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

naive_conv_nonpacked_fwd_nchw_half_double_half in KDB cache breaks after #2863

Open junliume opened this issue 1 year ago • 5 comments

[Observations}: Reproducer:

./bin/MIOpenDriver convfp16 -n 128 -c 3 -H 224 -W 224 -k 96 -y 7 -x 7 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

and

MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: naive_conv_nonpacked_fwd_nchw_half_double_half
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_nonpacked_fwd_nchw_half_double_half, global_work_dim = { 3145728, 1, 1 }, local_work_dim = { 256, 1, 1 }
GPU core dump failed
:0:rocdevice.cpp            :2958: 129010028118 us: [pid:163363 tid:0x7f35d0dff640] Callback: Queue 0x7ef5c8a00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)

However, if we delete the system KDB, this workload can run through

[Analysis] Something in #2863 has caused the incompatibility

  • @bghimireamd mentioned likely https://github.com/ROCm/MIOpen/blob/develop/src/kernels/gpu_reference_kernel/naive_conv.cpp#L147-L148
  • maybe it conflicts with https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/kernels/gpu_reference_kernel/fp8_naive_conv.cpp#L237-L239
  • However, re-generating KDB using DB-sync is not helping in this case

FYI: @JehandadKhan @atamazov


junliume avatar Jun 05 '24 04:06 junliume

It should be changes in this file which are causing this problem: https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv_fwd.cpp

and similar issues might be in other directions too

junliume avatar Jun 05 '24 06:06 junliume

By removing these lines: https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv.cpp#L441-L442 It starts to work properly again.

So the KDB cache does not accept these parameters. However, why the refreshed KDB still does not accept them? @cderb

junliume avatar Jun 05 '24 07:06 junliume

@atamazov FYI, this issue and the other GPU Target embedded in compiler issues are related. We will branch soon and need to regenerate quite a few KDBs with all fixes merged in develop branch first.

junliume avatar Jun 05 '24 23:06 junliume

@junliume I recommend reverting #2863, if we have this possibility. Adding alpha/beta to the existing convolution primitive is a nontrivial thing and should be carefully analyzed prior implementing.

atamazov avatar Jun 06 '24 13:06 atamazov

@junliume

[Analysis] Something in #2863 has caused the incompatibility

  • @bghimireamd mentioned likely https://github.com/ROCm/MIOpen/blob/develop/src/kernels/gpu_reference_kernel/naive_conv.cpp#L147-L148

Of course! If a PR changes the number of input arguments, or their types, then KDB must be regenerated. And changing the number of input arguments is a substantial change (see https://github.com/ROCm/MIOpen/pull/2863/files#r1641574996).

[Note] Unfortunately, I see that some deeper redesign is required in order to make this bilinear stuff (alpha/beta) working correctly, see https://github.com/ROCm/MIOpen/pull/2863#discussion_r1641574235.

atamazov avatar Jun 15 '24 23:06 atamazov