naive_conv_nonpacked_fwd_nchw_half_double_half in KDB cache breaks after #2863
[Observations}: Reproducer:
./bin/MIOpenDriver convfp16 -n 128 -c 3 -H 224 -W 224 -k 96 -y 7 -x 7 -p 0 -q 0 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
and
MIOpen(HIP): Info2 [PrepareInvoker] Preparing kernel: naive_conv_nonpacked_fwd_nchw_half_double_half
MIOpen(HIP): Info2 [run] kernel_name = naive_conv_nonpacked_fwd_nchw_half_double_half, global_work_dim = { 3145728, 1, 1 }, local_work_dim = { 256, 1, 1 }
GPU core dump failed
:0:rocdevice.cpp :2958: 129010028118 us: [pid:163363 tid:0x7f35d0dff640] Callback: Queue 0x7ef5c8a00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Aborted (core dumped)
However, if we delete the system KDB, this workload can run through
[Analysis] Something in #2863 has caused the incompatibility
- @bghimireamd mentioned likely https://github.com/ROCm/MIOpen/blob/develop/src/kernels/gpu_reference_kernel/naive_conv.cpp#L147-L148
- maybe it conflicts with https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/kernels/gpu_reference_kernel/fp8_naive_conv.cpp#L237-L239
- However, re-generating KDB using DB-sync is not helping in this case
FYI: @JehandadKhan @atamazov
It should be changes in this file which are causing this problem: https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv_fwd.cpp
and similar issues might be in other directions too
By removing these lines: https://github.com/ROCm/MIOpen/blob/9064d096005af004c64d76a8c4c326a6cfd01c0f/src/solver/conv_direct_naive_conv.cpp#L441-L442 It starts to work properly again.
So the KDB cache does not accept these parameters. However, why the refreshed KDB still does not accept them? @cderb
@atamazov FYI, this issue and the other GPU Target embedded in compiler issues are related. We will branch soon and need to regenerate quite a few KDBs with all fixes merged in develop branch first.
@junliume I recommend reverting #2863, if we have this possibility. Adding alpha/beta to the existing convolution primitive is a nontrivial thing and should be carefully analyzed prior implementing.
@junliume
[Analysis] Something in #2863 has caused the incompatibility
- @bghimireamd mentioned likely https://github.com/ROCm/MIOpen/blob/develop/src/kernels/gpu_reference_kernel/naive_conv.cpp#L147-L148
Of course! If a PR changes the number of input arguments, or their types, then KDB must be regenerated. And changing the number of input arguments is a substantial change (see https://github.com/ROCm/MIOpen/pull/2863/files#r1641574996).
[Note] Unfortunately, I see that some deeper redesign is required in order to make this bilinear stuff (alpha/beta) working correctly, see https://github.com/ROCm/MIOpen/pull/2863#discussion_r1641574235.