MIOpen
MIOpen copied to clipboard
Batch norm failure on gfx908
I encountered bn configs failing on gfx908 while using the develop branch. Here are sample configs that failed (these are not all) ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 1 -b 0 -s 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 0 -b 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 1 -b 0 -s 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 0 -b 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 0 -b 1 -r 1
Here are sample configs that ran fine: ./bin/MIOpenDriver bnormfp16 -n 256 -c 1024 -H 14 -W 14 -m 1 --forw 2 -b 0 -s 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 2048 -H 7 -W 7 -m 1 --forw 2 -b 0 -s 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 256 -H 56 -W 56 -m 1 --forw 2 -b 0 -s 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 2 -b 0 -s 1
Attaching sample log
@alexandraBara Which docker did you use on gfx908 machine?
All above failed tests are passed on my gfx908 machine with this docker (superbench/main:rocm5.0.1-pytorch1.9.0). Is this issue docker specific?
root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 1 -b 0 -s 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 1 -b 0 -s 1 -r 1 Forward train batch norm verification passed on running mean. Forward train batch norm verification passed on running variance Forward train batch norm verification passed on saved mean Forward train batch norm verification passed on saved inverse variance. Forward batch norm verification passed on output Forward Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 0 -b 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 0 -b 1 -r 1 Backwards prop batch norm verification passed on dx. Backwards prop batch norm verification passed on dscale. Backwards prop batch norm verification passed on dbias. Backwards Prop Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 1 -b 0 -s 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 1 -b 0 -s 1 -r 1 Forward train batch norm verification passed on running mean. Forward train batch norm verification passed on running variance Forward train batch norm verification passed on saved mean Forward train batch norm verification passed on saved inverse variance. Forward batch norm verification passed on output Forward Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 0 -b 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 0 -b 1 -r 1 Backwards prop batch norm verification passed on dx. Backwards prop batch norm verification passed on dscale. Backwards prop batch norm verification passed on dbias. Backwards Prop Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 0 -b 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 0 -b 1 -r 1 Backwards prop batch norm verification passed on dx. Backwards prop batch norm verification passed on dscale. Backwards prop batch norm verification passed on dbias. Backwards Prop Batch Norm Verifies on CPU and GPU.
@muralinr please see above file for extra logging. It seems like it might be docker specific since @junliume ran a different docker on the same machine and it passed for him. Let me know if you see any indication from the above logs as to what might be causing it.
It seems like the -mcpu argument comes up empty when running BN with the hip nogpu flag enabled and this causes the problem :
MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenBatchNormFwdTrainSpatial.cl.o; args: -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_FPMIX=1 -DMIO_SAVE_MEAN_VARIANCE=1 -DMIO_RUNNING_RESULT=1 -DMIO_BN_VARIANT=1 -DMIO_BN_LDS_SIZE=1024 -DMIO_BN_LDSGCN_SIZE=16 -DMIO_BN_N=256 -DMIO_BN_GRP0=1024 -DMIO_BN_GRP1=1 -DMIO_BN_GRP2=1 -DMIO_BN_GFX1030=0 -DMIO_LAYOUT_NHWC=0 -DMIO_BN_C=512 -DMIO_BN_HW=784 -DMIO_BN_NHW=200704 -DMIO_BN_CHW=401408 -DMIO_BN_NCHW=102760448 -mcpu=
I will close this issue since MIOpen is not using the nogpu use case and will fix it in our internal tool.
Reopening the issue, since we can are still seeing this issue and we found it is not related to using the nogpu backend. @daniellowell investigated the failure and was able to reproduce it. @daniellowell can you please post your findings here.
All issues I've discovered are local to MIO_BN_VARIANT=1
kernel.
From the error output:
root@hpe-sjc2-05:~/dMIOpen/src/kernels# /opt/rocm/bin/clang-ocl -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_FPMIX=0 -DMIO_BN_USESAVED=1 -DMIO_BN_N=256 -DMIO_BN_C=64 -DMIO_BN_HW=3136 -DMIO_BN_NHW=802816 -DMIO_BN_CHW=200704 -DMIO_BN_NCHW=51380224 -DMIO_BN_LDS_SIZE=1024 -DMIO_BN_LDSGCN_SIZE=16 -DMIO_BN_VARIANT=1 -DMIO_WAVESIZE=64 -DMIO_BN_GRP0=1024 -DMIO_BN_GRP1=1 -DMIO_BN_GRP2=1 -DMIO_LAYOUT_NHWC=0 -DMIO_BN_GFX103X=0 -mcpu=gfx908 MIOpenBatchNormBwdSpatial.cl -o /tmp/miopen-MIOpenBatchNormBwdSpatial.cl-6a0d-dde4-3b3b-860f/MIOpenBatchNormBwdSpatial.cl.o
MIOpenBatchNormBwdSpatial.cl:463:14: error: call to 'mad' is ambiguous
ds = mad(xhat4.x, (_FLOAT_PREC)dyRead4.x, ds);
^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
^
MIOpenBatchNormBwdSpatial.cl:464:14: error: call to 'mad' is ambiguous
ds = mad(xhat4.y, (_FLOAT_PREC)dyRead4.y, ds);
^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
^
MIOpenBatchNormBwdSpatial.cl:465:14: error: call to 'mad' is ambiguous
ds = mad(xhat4.z, (_FLOAT_PREC)dyRead4.z, ds);
^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
^
MIOpenBatchNormBwdSpatial.cl:466:14: error: call to 'mad' is ambiguous
ds = mad(xhat4.w, (_FLOAT_PREC)dyRead4.w, ds);
^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
^
MIOpenBatchNormBwdSpatial.cl:526:30: error: use of undeclared identifier 'temp_db'
*(dbias + grpid) = (temp_db >= (float)MAX_VAL) ? MAX_VAL : db;
^
MIOpenBatchNormBwdSpatial.cl:527:30: error: use of undeclared identifier 'temp_ds'
*(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds;
^
MIOpenBatchNormBwdSpatial.cl:527:59: error: use of undeclared identifier 'temp_ds'
*(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds;
^
MIOpenBatchNormBwdSpatial.cl:558:64: error: use of undeclared identifier 'temp_db'
float temp_tmp1 = mad((float)NHW, (float)dyvalue, -temp_db);
^
MIOpenBatchNormBwdSpatial.cl:559:48: error: use of undeclared identifier 'temp_ds'
float temp_tmp2 = -((float)xhat) * temp_ds;
^
9 errors generated.
Seems to be 2 issues that I can see with the failing compile line. First is that
ds = mad(xhat4.x, (_FLOAT_PREC)dyRead4.x, ds);
One of the above variables is not the same type as the others. Which means the compiler does not know which function to call, all double
, all float
, or all half
. Might have been hiding there for a while. Not sure really, as this particular kernel variant 1 has been heavily refactored and the macros might need to be tweaked.
The second issue is that the variable temp_ds
and temp_db
are not declared anywhere:
*(dbias + grpid) = (temp_db >= (float)MAX_VAL) ? (_FLOAT_PREC)MAX_VAL : (_FLOAT_PREC)db;
*(dscale + grpid) =
(temp_ds >= (float)MAX_VAL || temp_ds < 0) ? (_FLOAT_PREC)MAX_VAL : (_FLOAT_PREC)ds;
I tracked gitblame back to the refactor, as the original kernel actually does have a declaration:
float temp_ds = (float)ds;
regLDSreduce(&temp_ds, lcl_data, lid, (float)1.0);
ds = (_FLOAT)temp_ds;
barrier(CLK_LOCAL_MEM_FENCE);
Either @muralinr, or @ce1adon, since those were the last modifiers.
To reproduce:
docker pull rocm/miopen-private:alex_datatype_fix
Then launch the docker container:
docker run -it --network host -v /home/miopenpdb:/data --device=/dev/kfd --device=/dev/dri --group-add video ocm/miopen-private:alex_datatype_fix
Go to:
cd /root/dFin/_hip
Create a file fin_input.json
with this contents:
[
{
"steps": [
"miopen_find_compile"
],
"arch": "gfx908:sram-ecc+:xnack-",
"num_cu": 120,
"config_tuna_id": 28,
"direction": 4,
"dynamic_only": false,
"config" : {
"alpha":1,
"back":0,
"batchsize":256,
"beta":0,
"bias":0,
"cmd":"bnormfp16",
"forw":1,
"id":28,
"in_channels":64,
"in_d":1,
"in_h":56,
"in_layout":"NCHW",
"in_w":56,
"mode":1,
"out_layout":"NCHW",
"run":1,
"save":1,
"verify":1
}
}
]
build:
make -j
Run the command:
MIOPEN_LOG_LEVEL=6 ./src/fin -i fin_input.json -o my_output_file.json
Enjoy the errors!
To reproduce minimally:
/opt/rocm/bin/clang-ocl -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_FPMIX=0 -DMIO_BN_USESAVED=1 -DMIO_BN_N=256 -DMIO_BN_C=64 -DMIO_BN_HW=3136 -DMIO_BN_NHW=802816 -DMIO_BN_CHW=200704 -DMIO_BN_NCHW=51380224 -DMIO_BN_LDS_SIZE=1024 -DMIO_BN_LDSGCN_SIZE=16 -DMIO_BN_VARIANT=1 -DMIO_WAVESIZE=64 -DMIO_BN_GRP0=1024 -DMIO_BN_GRP1=1 -DMIO_BN_GRP2=1 -DMIO_LAYOUT_NHWC=0 -DMIO_BN_GFX103X=0 -mcpu=gfx908 -Wno-everything -mcode-object-version=4 MIOpenBatchNormBwdSpatial.cl -o /tmp/miopen-MIOpenBatchNormBwdSpatial.cl-31a6-867b-afc8-20da/MIOpenBatchNormBwdSpatial.cl.o
that is the command to reproduce in the folder: src/kernels
@shurale-nkn could you take a look at this issue?
This kernel uses mixed precision computations, float for acc and half for data. So expected to use V_MAD_MIX_F32, but we don't have mixed math MAD in opencl.h
Other opencl kernels prefer to ignore mixed precision and use only one data type (half) for data and acc.
Well, first about the conditions when error can occur.
- This is NoGPU runtime
- This is a system with a GPU, otherwise the MIOpenDriver will die earlier, before kernel selection
- MIOpen should be compiled without
-DBUILD_DEV=On
.
- With
-DBUILD_DEV=On
OpenCL kernel will be compiled with-Weverything
. In that case Clang should check correctness of arguments and return an error if-mcpu=
not defined.clang-13: error: joined argument expects additional value: '-mcpu=' [-Werror,-Wunused-command-line-argument]
- Without
-DBUILD_DEV=On
OpenCL kernel will be compiled with-Wno-everything
and clang will ignore the absence of-mcpu=
argument
I don't know what behavior the library should have when it using a non-GPU runtime on a system with a GPU. We don't have such tests on Jenkins. @JehandadKhan can you help?
@JehandadKhan could you take a look since it is NoGPU runtime?
MIOpenBatchNormBwdSpatial.cl:526:30: error: use of undeclared identifier 'temp_db' *(dbias + grpid) = (temp_db >= (float)MAX_VAL) ? MAX_VAL : db; ^ MIOpenBatchNormBwdSpatial.cl:527:30: error: use of undeclared identifier 'temp_ds' *(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds; ^ MIOpenBatchNormBwdSpatial.cl:527:59: error: use of undeclared identifier 'temp_ds' *(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds; ^ MIOpenBatchNormBwdSpatial.cl:558:64: error: use of undeclared identifier 'temp_db' float temp_tmp1 = mad((float)NHW, (float)dyvalue, -temp_db); ^ MIOpenBatchNormBwdSpatial.cl:559:48: error: use of undeclared identifier 'temp_ds' float temp_tmp2 = -((float)xhat) * temp_ds;
^
PR #1603 should resolve this Errors.
But not errors from the first message. This is NoGPU related errors.
error: cannot compile inline asm
:3:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:1 bound_ctrl:0 ^ error: cannot compile inline asm :5:20: error: not a valid operand. v_add_f32 v5 v5 v5 row_shr:2 bound_ctrl:0 ^ error: cannot compile inline asm :6:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:2 bound_ctrl:0 ^ error: cannot compile inline asm :8:20: error: not a valid operand. v_add_f32 v5 v5 v5 row_shr:4 bank_mask:0xe ^ error: cannot compile inline asm :9:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:4 bank_mask:0xe ^ error: cannot compile inline asm :11:20: error: not a valid operand. v_add_f32 v5 v5 v5 row_shr:8 bank_mask:0xc ^ error: cannot compile inline asm :12:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:8 bank_mask:0xc
@junliume
- Now we have W/A in #1603, so the urgency may be lowered to normal.
- However we need someone who can properly resolve the problem described above. The description is quite good and gives a brief picture of when needs to be done.
- I am second to @daniellowell and recommend assigning @muralinr, if he has time.
@junliume
- Now we have W/A in [WORKAROUND] fix IsApplicable in BnBwdTrainingSpatial solver #1603, so the urgency may be lowered to normal.
- However we need someone who can properly resolve the problem described above. The description is quite good and gives a brief picture of when needs to be done.
- I am second to @daniellowell and recommend assigning @muralinr, if he has time.
Current Batchnorm code will be phased out soon. I am not sure it's worth the effort at this point of time if we have W/A fix. I will let @junliume to comment on it.
@junliume Is this ticket still relevant? Thanks!