tfjs icon indicating copy to clipboard operation
tfjs copied to clipboard

webgl: Fix NaN issue

Open qjia7 opened this issue 2 years ago • 1 comments

Fix #6822

Problem 1: On some GPUs, even if a and b are both non-NaN, the value of isNaN in vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0)); are still larger than 0., which misleads all values become NAN. 2: After resolving NAN issue, the result is still incorrect. It seems that the isnan_custom is not well supported on the problem GPU. After switching back to builtin isnan, everything works well.

Solution: Use the bool type bvec4 instead of float type vec4 to calculate isNaN to avoid the the float precision issue when comparing with zero. Meanwhile, add an env flag WEBGL_ISNAN_CUSTOM to allow user to specify which isnan to use.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

qjia7 avatar Sep 09 '22 07:09 qjia7

@vladmandic @shurshilov Please have the last test with this PR. I also make it ready for review. The verified solution https://github.com/tensorflow/tfjs/issues/6822#issuecomment-1247460709:

vec4 nanValue = a;
bvec4 isNaN = isnan(a);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

nanValue = b;
isNaN = isnan(b);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

return result;

The solution in this PR:

  bvec4 isNaNA = isnan(a);
  bvec4 isNaNB = isnan(b);
  bvec4 isNaN = bvec4(isNaNA.x || isNaNB.x, isNaNA.y || isNaNB.y, isNaNA.z || isNaNB.z, isNaNA.w || isNaNB.w);
  result.r = isNaN.r ? NAN : result.r;
  result.g = isNaN.g ? NAN : result.g;
  result.b = isNaN.b ? NAN : result.b;
  result.a = isNaN.a ? NAN : result.a;

  return result;

I think the second one should also be workable. But if not, I need to revert this change Use NAN instead of nanValue.xxx

qjia7 avatar Sep 15 '22 06:09 qjia7

Kindly ping you guys. Also add @Linchenn in case you are in vacation.

qjia7 avatar Sep 23 '22 00:09 qjia7

should this default to true to match the previous behavior?

if previous behavior is proven to be broken on some gpus, why?

afaik, only reason for isnan_custom is because webgl1 does not have native isnan, so why not enable it for webgl1 and disable otherwise. we should avoid creating new flags that nobody will remember few months from now.

vladmandic avatar Sep 26 '22 19:09 vladmandic

@qjia7 It sounds the correctness of the builtin/polyfilled isnan can be programmatically probed with a small snippet and then the right one could be picked up automatically without "creating new flags that nobody will remember few months from now", or no?

hujiajie avatar Sep 28 '22 01:09 hujiajie

@qjia7 It sounds the correctness of the builtin/polyfilled isnan can be programmatically probed with a small snippet and then the right one could be picked up automatically without "creating new flags that nobody will remember few months from now", or no?

In fact, It's hard to construct a small case to reflect the isnan issue. The current solution is to always use the builtin isnan. I know you may have concern about the comment here Use custom isnan definition to work across differences between implementations on various platforms.. But I think it's compare to the old custom isnan [return (val > 0. || val < 1. || val == 0.) ? false : true; ]. So I prefer we directly use the builtin isnan for simplification.

qjia7 avatar Sep 28 '22 02:09 qjia7