MIOpen Batch norm failure on gfx908

I encountered bn configs failing on gfx908 while using the develop branch. Here are sample configs that failed (these are not all) ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 1 -b 0 -s 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 0 -b 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 1 -b 0 -s 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 0 -b 1 -r 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 0 -b 1 -r 1

Here are sample configs that ran fine: ./bin/MIOpenDriver bnormfp16 -n 256 -c 1024 -H 14 -W 14 -m 1 --forw 2 -b 0 -s 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 2048 -H 7 -W 7 -m 1 --forw 2 -b 0 -s 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 256 -H 56 -W 56 -m 1 --forw 2 -b 0 -s 1 ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 2 -b 0 -s 1

Attaching sample log

May 18 '22 21:05 alexandraBara

bn_fail.txt

May 18 '22 21:05 alexandraBara

@alexandraBara Which docker did you use on gfx908 machine?

May 18 '22 21:05 muralinr

All above failed tests are passed on my gfx908 machine with this docker (superbench/main:rocm5.0.1-pytorch1.9.0). Is this issue docker specific?

root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 1 -b 0 -s 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 1 -b 0 -s 1 -r 1 Forward train batch norm verification passed on running mean. Forward train batch norm verification passed on running variance Forward train batch norm verification passed on saved mean Forward train batch norm verification passed on saved inverse variance. Forward batch norm verification passed on output Forward Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 0 -b 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 64 -H 112 -W 112 -m 1 --forw 0 -b 1 -r 1 Backwards prop batch norm verification passed on dx. Backwards prop batch norm verification passed on dscale. Backwards prop batch norm verification passed on dbias. Backwards Prop Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 1 -b 0 -s 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 1 -b 0 -s 1 -r 1 Forward train batch norm verification passed on running mean. Forward train batch norm verification passed on running variance Forward train batch norm verification passed on saved mean Forward train batch norm verification passed on saved inverse variance. Forward batch norm verification passed on output Forward Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 0 -b 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 512 -H 7 -W 7 -m 1 --forw 0 -b 1 -r 1 Backwards prop batch norm verification passed on dx. Backwards prop batch norm verification passed on dscale. Backwards prop batch norm verification passed on dbias. Backwards Prop Batch Norm Verifies on CPU and GPU. root@miopen908-1:/testx/MIOpen/build# ./bin/MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 0 -b 1 -r 1 MIOpenDriver bnormfp16 -n 256 -c 512 -H 28 -W 28 -m 1 --forw 0 -b 1 -r 1 Backwards prop batch norm verification passed on dx. Backwards prop batch norm verification passed on dscale. Backwards prop batch norm verification passed on dbias. Backwards Prop Batch Norm Verifies on CPU and GPU.

May 18 '22 22:05 muralinr

bn_fail_extralog.txt

May 19 '22 15:05 alexandraBara

@muralinr please see above file for extra logging. It seems like it might be docker specific since @junliume ran a different docker on the same machine and it passed for him. Let me know if you see any indication from the above logs as to what might be causing it.

May 19 '22 15:05 alexandraBara

It seems like the -mcpu argument comes up empty when running BN with the hip nogpu flag enabled and this causes the problem :

MIOpen(HIP): Info2 [LoadBinary] Loading binary for: MIOpenBatchNormFwdTrainSpatial.cl.o; args: -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_FPMIX=1 -DMIO_SAVE_MEAN_VARIANCE=1 -DMIO_RUNNING_RESULT=1 -DMIO_BN_VARIANT=1 -DMIO_BN_LDS_SIZE=1024 -DMIO_BN_LDSGCN_SIZE=16 -DMIO_BN_N=256 -DMIO_BN_GRP0=1024 -DMIO_BN_GRP1=1 -DMIO_BN_GRP2=1 -DMIO_BN_GFX1030=0 -DMIO_LAYOUT_NHWC=0 -DMIO_BN_C=512 -DMIO_BN_HW=784 -DMIO_BN_NHW=200704 -DMIO_BN_CHW=401408 -DMIO_BN_NCHW=102760448 -mcpu=

May 19 '22 15:05 alexandraBara

I will close this issue since MIOpen is not using the nogpu use case and will fix it in our internal tool.

May 19 '22 15:05 alexandraBara

Reopening the issue, since we can are still seeing this issue and we found it is not related to using the nogpu backend. @daniellowell investigated the failure and was able to reproduce it. @daniellowell can you please post your findings here.

Jun 10 '22 18:06 alexandraBara

All issues I've discovered are local to MIO_BN_VARIANT=1 kernel.

From the error output:

root@hpe-sjc2-05:~/dMIOpen/src/kernels# /opt/rocm/bin/clang-ocl -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_FPMIX=0 -DMIO_BN_USESAVED=1 -DMIO_BN_N=256 -DMIO_BN_C=64 -DMIO_BN_HW=3136 -DMIO_BN_NHW=802816 -DMIO_BN_CHW=200704 -DMIO_BN_NCHW=51380224 -DMIO_BN_LDS_SIZE=1024 -DMIO_BN_LDSGCN_SIZE=16 -DMIO_BN_VARIANT=1 -DMIO_WAVESIZE=64 -DMIO_BN_GRP0=1024 -DMIO_BN_GRP1=1 -DMIO_BN_GRP2=1 -DMIO_LAYOUT_NHWC=0 -DMIO_BN_GFX103X=0 -mcpu=gfx908 MIOpenBatchNormBwdSpatial.cl -o /tmp/miopen-MIOpenBatchNormBwdSpatial.cl-6a0d-dde4-3b3b-860f/MIOpenBatchNormBwdSpatial.cl.o
MIOpenBatchNormBwdSpatial.cl:463:14: error: call to 'mad' is ambiguous
        ds = mad(xhat4.x, (_FLOAT_PREC)dyRead4.x, ds);
             ^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
                    ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
                     ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
                   ^
MIOpenBatchNormBwdSpatial.cl:464:14: error: call to 'mad' is ambiguous
        ds = mad(xhat4.y, (_FLOAT_PREC)dyRead4.y, ds);
             ^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
                    ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
                     ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
                   ^
MIOpenBatchNormBwdSpatial.cl:465:14: error: call to 'mad' is ambiguous
        ds = mad(xhat4.z, (_FLOAT_PREC)dyRead4.z, ds);
             ^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
                    ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
                     ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
                   ^
MIOpenBatchNormBwdSpatial.cl:466:14: error: call to 'mad' is ambiguous
        ds = mad(xhat4.w, (_FLOAT_PREC)dyRead4.w, ds);
             ^~~
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8488:21: note: candidate function
float __ovld __cnfn mad(float a, float b, float c);
                    ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8495:22: note: candidate function
double __ovld __cnfn mad(double a, double b, double c);
                     ^
/opt/rocm-5.1.0/llvm/lib/clang/14.0.0/include/opencl-c.h:8503:20: note: candidate function
half __ovld __cnfn mad(half a, half b, half c);
                   ^
MIOpenBatchNormBwdSpatial.cl:526:30: error: use of undeclared identifier 'temp_db'
        *(dbias + grpid)  = (temp_db >= (float)MAX_VAL) ? MAX_VAL : db;
                             ^
MIOpenBatchNormBwdSpatial.cl:527:30: error: use of undeclared identifier 'temp_ds'
        *(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds;
                             ^
MIOpenBatchNormBwdSpatial.cl:527:59: error: use of undeclared identifier 'temp_ds'
        *(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds;
                                                          ^
MIOpenBatchNormBwdSpatial.cl:558:64: error: use of undeclared identifier 'temp_db'
            float temp_tmp1 = mad((float)NHW, (float)dyvalue, -temp_db);
                                                               ^
MIOpenBatchNormBwdSpatial.cl:559:48: error: use of undeclared identifier 'temp_ds'
            float temp_tmp2 = -((float)xhat) * temp_ds;
                                               ^
9 errors generated.

Seems to be 2 issues that I can see with the failing compile line. First is that

ds = mad(xhat4.x, (_FLOAT_PREC)dyRead4.x, ds);

One of the above variables is not the same type as the others. Which means the compiler does not know which function to call, all double, all float, or all half. Might have been hiding there for a while. Not sure really, as this particular kernel variant 1 has been heavily refactored and the macros might need to be tweaked.

The second issue is that the variable temp_ds and temp_db are not declared anywhere:

*(dbias + grpid) = (temp_db >= (float)MAX_VAL) ? (_FLOAT_PREC)MAX_VAL : (_FLOAT_PREC)db;
*(dscale + grpid) =
(temp_ds >= (float)MAX_VAL || temp_ds < 0) ? (_FLOAT_PREC)MAX_VAL : (_FLOAT_PREC)ds;

I tracked gitblame back to the refactor, as the original kernel actually does have a declaration:

float temp_ds = (float)ds;
regLDSreduce(&temp_ds, lcl_data, lid, (float)1.0);
ds = (_FLOAT)temp_ds;
barrier(CLK_LOCAL_MEM_FENCE);

Either @muralinr, or @ce1adon, since those were the last modifiers.

Jun 10 '22 18:06 daniellowell

To reproduce:

docker pull rocm/miopen-private:alex_datatype_fix

Then launch the docker container:

docker run -it --network host -v /home/miopenpdb:/data --device=/dev/kfd --device=/dev/dri --group-add video ocm/miopen-private:alex_datatype_fix

Go to:

cd /root/dFin/_hip

Create a file fin_input.json with this contents:

[
  {
    "steps": [
      "miopen_find_compile"
    ],
    "arch": "gfx908:sram-ecc+:xnack-",
    "num_cu": 120,
    "config_tuna_id": 28,
    "direction": 4,
    "dynamic_only": false,
    "config" : {
      "alpha":1,
      "back":0,
      "batchsize":256,
      "beta":0,
      "bias":0,
      "cmd":"bnormfp16",
      "forw":1,
      "id":28,
      "in_channels":64,
      "in_d":1,
      "in_h":56,
      "in_layout":"NCHW",
      "in_w":56,
      "mode":1,
      "out_layout":"NCHW",
      "run":1,
      "save":1,
      "verify":1
   }
  }
]

build:

make -j

Run the command:

MIOPEN_LOG_LEVEL=6 ./src/fin -i fin_input.json -o my_output_file.json

Enjoy the errors!

Jun 10 '22 18:06 daniellowell

To reproduce minimally:

/opt/rocm/bin/clang-ocl -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_FPMIX=0 -DMIO_BN_USESAVED=1 -DMIO_BN_N=256 -DMIO_BN_C=64 -DMIO_BN_HW=3136 -DMIO_BN_NHW=802816 -DMIO_BN_CHW=200704 -DMIO_BN_NCHW=51380224 -DMIO_BN_LDS_SIZE=1024 -DMIO_BN_LDSGCN_SIZE=16 -DMIO_BN_VARIANT=1 -DMIO_WAVESIZE=64 -DMIO_BN_GRP0=1024 -DMIO_BN_GRP1=1 -DMIO_BN_GRP2=1 -DMIO_LAYOUT_NHWC=0 -DMIO_BN_GFX103X=0 -mcpu=gfx908 -Wno-everything -mcode-object-version=4 MIOpenBatchNormBwdSpatial.cl -o /tmp/miopen-MIOpenBatchNormBwdSpatial.cl-31a6-867b-afc8-20da/MIOpenBatchNormBwdSpatial.cl.o

that is the command to reproduce in the folder: src/kernels

Jun 10 '22 19:06 daniellowell

@shurale-nkn could you take a look at this issue?

Jun 10 '22 22:06 junliume

This kernel uses mixed precision computations, float for acc and half for data. So expected to use V_MAD_MIX_F32, but we don't have mixed math MAD in opencl.h

Other opencl kernels prefer to ignore mixed precision and use only one data type (half) for data and acc.

Jun 11 '22 01:06 shurale-nkn

Well, first about the conditions when error can occur.

This is NoGPU runtime
This is a system with a GPU, otherwise the MIOpenDriver will die earlier, before kernel selection
MIOpen should be compiled without -DBUILD_DEV=On.

With -DBUILD_DEV=On OpenCL kernel will be compiled with -Weverything. In that case Clang should check correctness of arguments and return an error if -mcpu= not defined. clang-13: error: joined argument expects additional value: '-mcpu=' [-Werror,-Wunused-command-line-argument]
Without -DBUILD_DEV=On OpenCL kernel will be compiled with -Wno-everything and clang will ignore the absence of -mcpu= argument

I don't know what behavior the library should have when it using a non-GPU runtime on a system with a GPU. We don't have such tests on Jenkins. @JehandadKhan can you help?

Jun 17 '22 01:06 shurale-nkn

@JehandadKhan could you take a look since it is NoGPU runtime?

Jun 23 '22 22:06 junliume

MIOpenBatchNormBwdSpatial.cl:526:30: error: use of undeclared identifier 'temp_db' *(dbias + grpid) = (temp_db >= (float)MAX_VAL) ? MAX_VAL : db; ^ MIOpenBatchNormBwdSpatial.cl:527:30: error: use of undeclared identifier 'temp_ds' *(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds; ^ MIOpenBatchNormBwdSpatial.cl:527:59: error: use of undeclared identifier 'temp_ds' *(dscale + grpid) = (temp_ds >= (float)MAX_VAL || temp_ds < 0) ? MAX_VAL : ds; ^ MIOpenBatchNormBwdSpatial.cl:558:64: error: use of undeclared identifier 'temp_db' float temp_tmp1 = mad((float)NHW, (float)dyvalue, -temp_db); ^ MIOpenBatchNormBwdSpatial.cl:559:48: error: use of undeclared identifier 'temp_ds' float temp_tmp2 = -((float)xhat) * temp_ds;

PR #1603 should resolve this Errors.

But not errors from the first message. This is NoGPU related errors.

error: cannot compile inline asm :3:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:1 bound_ctrl:0 ^ error: cannot compile inline asm :5:20: error: not a valid operand. v_add_f32 v5 v5 v5 row_shr:2 bound_ctrl:0 ^ error: cannot compile inline asm :6:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:2 bound_ctrl:0 ^ error: cannot compile inline asm :8:20: error: not a valid operand. v_add_f32 v5 v5 v5 row_shr:4 bank_mask:0xe ^ error: cannot compile inline asm :9:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:4 bank_mask:0xe ^ error: cannot compile inline asm :11:20: error: not a valid operand. v_add_f32 v5 v5 v5 row_shr:8 bank_mask:0xc ^ error: cannot compile inline asm :12:20: error: not a valid operand. v_add_f32 v3 v3 v3 row_shr:8 bank_mask:0xc

Jun 24 '22 18:06 shurale-nkn

@junliume

Now we have W/A in #1603, so the urgency may be lowered to normal.
However we need someone who can properly resolve the problem described above. The description is quite good and gives a brief picture of when needs to be done.
I am second to @daniellowell and recommend assigning @muralinr, if he has time.

Jul 25 '22 17:07 atamazov

@junliume

Now we have W/A in [WORKAROUND] fix IsApplicable in BnBwdTrainingSpatial solver #1603, so the urgency may be lowered to normal.

However we need someone who can properly resolve the problem described above. The description is quite good and gives a brief picture of when needs to be done.

I am second to @daniellowell and recommend assigning @muralinr, if he has time.

Current Batchnorm code will be phased out soon. I am not sure it's worth the effort at this point of time if we have W/A fix. I will let @junliume to comment on it.

Jul 25 '22 18:07 muralinr

@junliume Is this ticket still relevant? Thanks!

Apr 16 '24 16:04 ppanchad-amd

MIOpen MIOpen copied to clipboard

Batch norm failure on gfx908

MIOpen
MIOpen copied to clipboard