oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

gpu: generic: sycl: lnorm Intel GPU precision issues

Open kala855 opened this issue 1 year ago • 8 comments

Description

When used in Intel PVC, the layer normalization SYCL kernel implementation faces some precision issues in variance computation.

create: --lnorm --engine=gpu --inplace=true 30x300
run: --lnorm --engine=gpu --inplace=true 30x300
[   0][VAR][0] exp_f32:   0.0202823 exp:   0.0202823 got:   0.0202823 diff:1.86265e-09 rdiff:9.1836e-08
[   1][VAR][1] exp_f32:   0.0804944 exp:   0.0804944 got:   0.0804944 diff:7.45058e-09 rdiff:9.25603e-08
[   2][VAR][2] exp_f32:    0.327062 exp:    0.327062 got:    0.327062 diff:2.98023e-08 rdiff:9.11213e-08
[   6][VAR][6] exp_f32:     82.4779 exp:     82.4779 got:     82.4779 diff:7.62939e-06 rdiff:9.25023e-08
[  17][VAR][17] exp_f32:      1.3052 exp:      1.3052 got:      1.3052 diff:1.19209e-07 rdiff:9.1334e-08
[  22][VAR][22] exp_f32:   0.0829537 exp:   0.0829537 got:   0.0829537 diff:7.45058e-09 rdiff:8.98161e-08
[  23][VAR][23] exp_f32:    0.324289 exp:    0.324289 got:    0.324289 diff:2.98023e-08 rdiff:9.19006e-08
[  29][VAR][29] exp_f32:   0.0822034 exp:   0.0822034 got:   0.0822034 diff:7.45058e-09 rdiff:9.0636e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:7.62939e-06 err_max_rdiff:9.25603e-08 all_max_diff:7.62939e-06 all_max_rdiff:9.25603e-08
0:FAILED (errors:8 total:9060) __REPRO: --lnorm --engine=gpu --inplace=true 30x300
[  14][VAR][14] exp_f32: 2.68618e-05 exp: 2.68618e-05 got: 2.68618e-05 diff:1.81899e-12 rdiff:6.77165e-08
[COMPARE_STATS][VAR]: trh=0 err_max_diff:1.81899e-12 err_max_rdiff:6.77165e-08 all_max_diff:1.81899e-12 all_max_rdiff:6.77165e-08
8471:FAILED (errors:1 total:75) __REPRO: --lnorm --engine=gpu --dt=f32:bf16 --tag=axb --stat_tag=abx --flags=CH 15x3_n"lnorm_ci_0d:0"

The previous are just a couple of failing examples.

As a proposal, the variance threshold is modified to pass the failing tests.

kala855 avatar Sep 02 '24 13:09 kala855

To give a little bit of additional context. I found that in this line. The division is making v_variance slightly different from the reference that is computed in benchdnn (there is a thr=0 for these cases). Maybe that is happening because of some compiler optimization.

kala855 avatar Sep 02 '24 13:09 kala855

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device? We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

mgouicem avatar Sep 03 '24 10:09 mgouicem

@kala855, benchdnn is very sensitive to numerical issues by design. So in cases like this it's important to understand where the difference is coming from before considering the threshold change.

vpirogov avatar Sep 03 '24 16:09 vpirogov

Do you know what is the default setting for fp_config::correctly_rounded_divide_sqrt on this device? We typically disable fast-reciprocal, I wonder if we should do something similar for sycl kernels

We have been checking carefully these days to see what is happening here. Doing a comparison between the assembly generated by icpx on the OCL and SYCL versions of the implementations we found that:

  1. The OCL version generates an instruction math.invm which is IEEE standard compliant and is used to compute the division.
  2. The SYCL version generates an instruction math.inv which is not IEEE standard compliant and is the instruction that generates the precision issues.
  3. We were looking at different compiler flags to try to generate the same behavior on the SYCL side but was not possible.
  4. We are thinking that the problem is due to some compiler bug that does not take into account the flags to generate the expected code. (-fno-fast-math, -ffp-model=precise, etc.)
  5. Info about math.inv and math.invm instructions could be found here pages 24 and 26 respectively.

Any suggestions or feedback will be more than welcome. Thanks.

@mgouicem @t4c1 @sgeor255

kala855 avatar Sep 13 '24 08:09 kala855

Thanks for checking. I would suggest:

  • to open an issue against the sycl compiler with your findings,
  • add a condition on that threshold increase on DNNL_WITH_SYCL
  • add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

mgouicem avatar Sep 16 '24 08:09 mgouicem

Thanks for checking. I would suggest:

  • to open an issue against the sycl compiler with your findings,
  • add a condition on that threshold increase on DNNL_WITH_SYCL
  • add a comment to your patch so that we can revert the threshold as the issue is resolved on sycl compiler side.

Hi @mgouicem I did what you mentioned to try to get the SYCL lnorm benchdnn tests working. The issue was opened and a workaround was pushed in this PR. Thanks for your feedback.

kala855 avatar Sep 30 '24 11:09 kala855

We found that SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" environment variable strengthens the requirements for floating-point division, making it correctly-rounded. I made some changes to set the variable and unset it just in the operators that need it. Batch normalization was failing because of the same issue. The last commit also fixes the problem in bnorm.

kala855 avatar Oct 09 '24 13:10 kala855

We found that SYCL_PROGRAM_COMPILE_OPTIONS="-cl-fp32-correctly-rounded-divide-sqrt" environment variable strengthens the requirements for floating-point division, making it correctly-rounded. I made some changes to set the variable and unset it just in the operators that need it. Batch normalization was failing because of the same issue. The last commit also fixes the problem in bnorm.

Is there a programmatic way to control the same knob (that does not involve envvar)? It is typically not thread-safe to modify environment variables and this should be avoided as much as possible IMHO.

mgouicem avatar Oct 10 '24 07:10 mgouicem

Closing as stale.

vpirogov avatar Jan 03 '25 23:01 vpirogov