tfjs icon indicating copy to clipboard operation
tfjs copied to clipboard

[webgl] enabling WEBGL_USE_SHAPES_UNIFORMS results in model execution returning null values

Open vladmandic opened this issue 1 year ago • 32 comments

user reported issue with my library and after a while it was traced down to WEBGL_USE_SHAPES_UNIFORMS
which i have enabled by default for over half a year now due to massive performance advantages

issue itself is that model execution shows no issues and it returns tensors in expected shape,
but tensor data is an array with all null values instead of expected float32 values

what is strange that on majority of systems there are no issues (plus i cannot reproduce locally)
and there is only single (quite small) affected model out of 10+ used

affected model

very simple code reproduction (again, i cannot reproduce myself)
(it runs a model using predefined image input and checks output validity)

test has tfjs debug enabled and you can see all ops in the browser console

environment:

  • tfjs: 3.20.0 (older versions are also affected)
  • browser: chrome 104 and 105
  • os: windows 10 pro

it's reported on two different systems using different graphics adapters (AMD Radeon and nVidia RTX)
both affected systems report no issues looking at chrome://gpu and webgl v2 is working fine

only other thing worth noting is that user is using non-latin os locale

link to original issue: https://github.com/vladmandic/human/issues/291

vladmandic avatar Sep 07 '22 14:09 vladmandic

@vladmandic I also can't reproduce it on my windows machine. @shurshilov I need your help to narrow down this issue to see which op's uniforms have the problem. Can you paste the console info in https://vladmandic.github.io/human/test/issue-faceres.html?default and https://vladmandic.github.io/human/test/issue-faceres.html?uniforms separately so that we may know from which op it becomes abnormal? And I am also curious if all models will be impacted by WEBGL_USE_SHAPES_UNIFORMS in your environment or only faceres?

And please also follow below steps to verify whether the uniform APIs for webgl1 works well in your system:

  1. Open https://registry.khronos.org/webgl/sdk/tests/webgl-conformance-tests.html?version=1.0.4 for webgl1 and https://registry.khronos.org/webgl/sdk/tests/webgl-conformance-tests.html?version=2.0.1 for webgl2
  2. Search all/conformance/uniforms and click the run button before it. Then see if all tests are pass.

qjia7 avatar Sep 08 '22 05:09 qjia7

@qjia7 i can confirm that library runs multiple models before this one and all return expected results - this is the only affected model out of several.

vladmandic avatar Sep 08 '22 11:09 vladmandic

https://vladmandic.github.io/human/test/issue-faceres.html?uniforms Hi, sorry for delay vladmandic.github.io-1662639126523.log vladmandic.github.io-1662639083084.log image image

shurshilov avatar Sep 08 '22 12:09 shurshilov

@qjia7 just a quick look - while there are no errors in default run (pass), when uniforms are enabled (fail) there are 160 instances of Found NaN in the result of '<op>'

vladmandic avatar Sep 08 '22 17:09 vladmandic

@vladmandic I see the first NaN op is Minimum in the log. I assume it's the first error happened. The second parameter of Minimum is a scalar. So I made some changes to process the scalar in #6828. And also added some env flags to specifically enable/disable some ops (unary ops, binary ops, encoded matrix)' uniforms. I need your help to apply PR #6828 and build it. Then use this webgl in your test and let @shurshilov have another try.

Here I may need you do several things (All below tests are still based on WEBGL_USE_SHAPES_UNIFORMS enabled).

  1. Directly apply PR #6828 and try it. In this version, it only do the scalar changes. All the new added flags keep unchanged. If it works, it means the error is caused by scalar.
  2. Based on 1), disable WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS. And test again. If it works, it means the error is caused by EncodeMatrixPackedProgram.
  3. Based on 2), disable WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS. And test again. If it works, it means the error is caused by binary ops.
  4. Based on 3), disable WEBGL_ENABLE_UNARY_SHAPES_UNIFORMS. And test again. If it works, it means the error is caused by unary ops.

And please help dump the test log info in each step.

PS. I will be OOO until Sep, 13.

qjia7 avatar Sep 09 '22 08:09 qjia7

ive built the custom tfjs from @qjia7 branch and created tests at https://github.com/vladmandic/tfjs-fix-uniforms

note that tests use url parser to set tfjs flags so you can try any combo of flags
but what was asked here is listed in the readme

@shurshilov please run tests in live tests section and upload the each log from browser console here

vladmandic avatar Sep 09 '22 12:09 vladmandic

https://vladmandic.github.io/tfjs-fix-uniforms/index.html [vl vladmandic.github.io-1662733013714.log https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true vladmandic.github.io-1662733064991.log https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true&WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=false vladmandic.github.io-1662733166647.log https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true&WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=false&WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=false&WEBGL_ENABLE_UNARY_SHAPES_UNIFORMS=false vladmandic.github.io-1662733190344.log

shurshilov avatar Sep 09 '22 14:09 shurshilov

https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true&WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=false&WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=false vladmandic.github.io-1662733322570.log

shurshilov avatar Sep 09 '22 14:09 shurshilov

@qjia7 based on cursory look, issue seems to be in in
BinaryOpPackedProgram which relies on pre-set outShape for checks and then calls standard BinaryOpProgram

looking at logs:

  • pass: default
  • fail: uniforms=true
  • fail: uniforms=true, matrix=false
  • pass: uniforms=true, matrix=false, binary=false
  • pass: uniforms=true, matrix=false, binary=false, unary=false

@shurshilov to have a confirmation, can you try with just that single variation disabled?

https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true&WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=true&WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=false&WEBGL_ENABLE_UNARY_SHAPES_UNIFORMS=true

vladmandic avatar Sep 09 '22 18:09 vladmandic

@vladmandic @shurshilov Thanks for helping digging this. Now I have strong feeling that the NaN data is caused by Minimum/Maximum's isnan check. In this CHECK_NAN_SNIPPET_PACKED, you can see when isNaN.xxx is true, it will directly wirte NAN result.

To confirm this, @vladmandic can you also try to disable WEBGL_PACK? Once WEBGL_PACK is disabled, it will call unpacked program. For Minimum, the CHECK_NAN_SNIPPE will keep A or B's value instead of write NaN. In this case, I suppose the result will not be all NaN but some incorrect values.

If above assumption is correct, maybe just one small test Minimum with the same shape a: 4D 1,112,112,64 b: 0D in the model can reproduce this issue.

qjia7 avatar Sep 10 '22 02:09 qjia7

image 2. image 3. image

shurshilov avatar Sep 13 '22 17:09 shurshilov

@qjia7 seems your hunch is on track, closest isolation is WEBGL_PACK_BINARY_OPERATIONS=false - this results in no errors

  1. all uniforms enabled except binary shapes

test completes with NaN values, but the values are incorrect!
this is a strange one as model completes without errors, but values returned are waaay off.

WEBGL_USE_SHAPES_UNIFORMS=true
WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=true
WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=false
WEBGL_ENABLE_UNARY_SHAPES_UNIFORMS=true
  1. all uniforms enabled with webgl pack disabled

test completes and results are correct

WEBGL_USE_SHAPES_UNIFORMS=true
WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=true
WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=true
WEBGL_ENABLE_UNARY_SHAPES_UNIFORMS=true
WEBGL_PACK=false
  1. all uniforms enabled with webgl pack for binary ops disabled

test completes and results are correct

WEBGL_USE_SHAPES_UNIFORMS=true
WEBGL_ENABLE_ENCODE_MATRIX_SHAPES_UNIFORMS=true
WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=true
WEBGL_ENABLE_UNARY_SHAPES_UNIFORMS=true
WEBGL_PACK=true
WEBGL_PACK_BINARY_OPERATIONS=false

btw, although we're getting closer to understanding the issue, i still have no idea why its manifesting only on some systems?

vladmandic avatar Sep 13 '22 19:09 vladmandic

I want to remind you that this error only occurs on my device if the hardware acceleration option is enabled in the browser settings. If it is turned off, then everything seems to be working)

shurshilov avatar Sep 13 '22 20:09 shurshilov

I want to remind you that this error only occurs on my device if the hardware acceleration option is enabled in the browser settings. If it is turned off, then everything seems to be working)

if gpu hw acceleration is disabled, then entire webgl v2 codepath gets auto-disabled by default, so that is expected.

vladmandic avatar Sep 13 '22 20:09 vladmandic

@vladmandic I updated PR #6828. Two extra commits are uploaded. 1) Remove NAN checking in binary 2) Change the way of isNaN checking for Binary. Can you help to build them separately with packed enabled and all uniform shapes enabled? And let @shurshilov have another try. Hope this is our last experiment.

I suspect the packed NaN checking is not correctly executed in shurshilov's device.

qjia7 avatar Sep 14 '22 07:09 qjia7

i've built 3 variations of tfjs:

  1. commit 1: using original diag code

  2. commits 1 + 2: with NaN check in packed binary op returning original values

  3. commits 1 + 2 + 3: with patched kernel ops with new NaN implementation

all tests are with all uniform shapes and pack ops enabled

@shurshilov please run the tests

@qjia7 note that tfjs variations are cumulative as tfjs fails during execution if building just with patched kernel ops as updated gsls code returns bool when expected value in CHECK_NAN_SNIPPET is float:

'>' : wrong operand types - no operation '>' exists that takes a left-hand operand of type 'bool' and a right operand of type 'const float' (or there is no acceptable conversion)

result.r = isNaN.r > 0. ? NAN : result.r;   

vladmandic avatar Sep 14 '22 14:09 vladmandic

1 image 2 image 3 image

shurshilov avatar Sep 14 '22 22:09 shurshilov

@shurshilov Great! We have found the right place to fix this issue. Have a quick summary here if you are interested for the reason.

  1. WEBGL_USE_SHAPES_UNIFORMS triggers this issue. But uniform is not the reason. It works as expected. This can be observed when WEBGL_PACKED=false && WEBGL_USE_SHAPES_UNIFORMS=true. We use the totally same uniforms. But it only fails on packed binary.
  2. By comparing the packed binary and unpacked binary, the most suspect point is the way of NAN checking is different. That's why we give the last try to remove the NAN checking or replace a new way to check it. The result proves that both of them are correct. So base on above points: I think below code snippet on @shurshilov'e device doesn't work normally or has precision issue.
 vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0));
  
  result.r = isNaN.r > 0. ? NAN : result.r;
  result.g = isNaN.g > 0. ? NAN : result.g;
  result.b = isNaN.b > 0. ? NAN : result.b;
  result.a = isNaN.a > 0. ? NAN : result.a;

It seems that each component of isNaN is not exactly equals to 0.. I suspect the value is very close to 0. but larger than it. So all isNaN.xxx > 0. checking is true. And we got NAN as the result. And on other devices, we can get the expected checking, so we can't reproduce it.

After changing above code as below, we can get correct result., which using bool instead of float comparing.

vec4 nanValue = a;
bvec4 isNaN = isnan(a);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

nanValue = b;
isNaN = isnan(b);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

return result;

@vladmandic

  1. all uniforms enabled except binary shapes test completes with NaN values, but the values are incorrect! this is a strange one as model completes without errors, but values returned are waaay off.

I think the reason is the NAN checking is still here no matter binary uniform shapes is enabled or disabled.

note that tfjs variations are cumulative as tfjs fails during execution if building just with patched kernel ops as updated gsls code returns bool when expected value in CHECK_NAN_SNIPPET is float:

Weird. I thought I have changed all the related places. Do you have case to reproduce it?

qjia7 avatar Sep 15 '22 01:09 qjia7

@shurshilov Great! We have found the right place to fix this issue. Have a quick summary here if you are interested for the reason.

  1. WEBGL_USE_SHAPES_UNIFORMS triggers this issue. But uniform is not the reason. It works as expected. This can be observed when WEBGL_PACKED=false && WEBGL_USE_SHAPES_UNIFORMS=true. We use the totally same uniforms. But it only fails on packed binary.
  2. By comparing the packed binary and unpacked binary, the most suspect point is the way of NAN checking is different. That's why we give the last try to remove the NAN checking or replace a new way to check it. The result proves that both of them are correct. So base on above points: I think below code snippet on @shurshilov'e device doesn't work normally or has precision issue.
 vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0));
  
  result.r = isNaN.r > 0. ? NAN : result.r;
  result.g = isNaN.g > 0. ? NAN : result.g;
  result.b = isNaN.b > 0. ? NAN : result.b;
  result.a = isNaN.a > 0. ? NAN : result.a;

It seems that each component of isNaN is not exactly equals to 0.. I suspect the value is very close to 0. but larger than it. So all isNaN.xxx > 0. checking is true. And we got NAN as the result. And on other devices, we can get the expected checking, so we can't reproduce it.

After changing above code as below, we can get correct result., which using bool instead of float comparing.

vec4 nanValue = a;
bvec4 isNaN = isnan(a);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

nanValue = b;
isNaN = isnan(b);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

return result;

@vladmandic

  1. all uniforms enabled except binary shapes test completes with NaN values, but the values are incorrect! this is a strange one as model completes without errors, but values returned are waaay off.

I think the reason is the NAN checking is still here no matter binary uniform shapes is enabled or disabled.

note that tfjs variations are cumulative as tfjs fails during execution if building just with patched kernel ops as updated gsls code returns bool when expected value in CHECK_NAN_SNIPPET is float:

Weird. I thought I have changed all the related places. Do you have case to reproduce it?

This is very similar to the truth, because when recognizing faces, the resulting values of the face descriptor are an array of numbers very close to zero but not equal to it...

shurshilov avatar Sep 15 '22 07:09 shurshilov

I'm waiting for a test from @vladmandic

shurshilov avatar Sep 15 '22 07:09 shurshilov

Have a quick summary here if you are interested for the reason.

@qjia7 always!

It seems that each component of isNaN is not exactly equals to 0.. I suspect the value is very close to 0. but larger than it.

wish i knew why as there is nothing in the setup that looks that different

anyhow, i did a fresh checkout of your branch and it looks good on my system
note that only WEBGL_USE_SHAPES_UNIFORMS is enabled, all other flags are left at defaults
(and code path does use packed binaries)

@shurshilov please test: https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true

vladmandic avatar Sep 15 '22 12:09 vladmandic

Have a quick summary here if you are interested for the reason.

@qjia7 always!

It seems that each component of isNaN is not exactly equals to 0.. I suspect the value is very close to 0. but larger than it.

wish i knew why as there is nothing in the setup that looks that different

anyhow, i did a fresh checkout of your branch and it looks good on my system note that only WEBGL_USE_SHAPES_UNIFORMS is enabled, all other flags are left at defaults (and code path does use packed binaries)

@shurshilov please test: https://vladmandic.github.io/tfjs-fix-uniforms/index.html?WEBGL_USE_SHAPES_UNIFORMS=true

image

shurshilov avatar Sep 15 '22 17:09 shurshilov

@qjia7 no runtime errors and no NaN values, but values are incorrect.

the results from the model are now the same (both incorrect) as with the test when WEBGL_ENABLE_BINARY_SHAPES_UNIFORMS=false and packed binaries were enabled.

but when you look at the results when WEBGL_PACK_BINARY_OPERATIONS=false, it returned correct values.
same when CHECK_NAN_SNIPPET is set to blank - model returns correct values.

and the results are not even close - for example, one of incorrect values is score for model prediction calculated at 60% instead of expected 93% (given the specific test input).

so we're closer, but there is still a problem with packed binaries when running binary ops.

btw, i dont really understand why is CHECK_NAN_SNIPPET used at all by default for packed binaries? its not used for unpacked binaries to start with.

vladmandic avatar Sep 15 '22 20:09 vladmandic

btw, i dont really understand why is CHECK_NAN_SNIPPET used at all by default for packed binaries? its not used for unpacked binaries to start with.

Both packed and unpacked binaries (not all binary ops, but only Atan2, Minimum, Maximum, Mod, Pow) are using CHECK_NAN_SNIPPET. In the test model, only Minimum and Maximum are used. That's why we see the fist error is always Minimum. The difference between packed and unpacked for binaries is that packed means the operands are vec4 type. But unpacked operands are float. That results that CHECK_NAN_SNIPPET are different. For unpacked Minimum

  if (isnan(a)) return a;
  if (isnan(b)) return b;
  return min(a, b);

packed Minimum

  vec4 result = vec4(min(a, b));
  vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0));
  
  result.r = isNaN.r > 0. ? NAN : result.r;
  result.g = isNaN.g > 0. ? NAN : result.g;
  result.b = isNaN.b > 0. ? NAN : result.b;
  result.a = isNaN.a > 0. ? NAN : result.a;

  return result;

So far, two methods work. 1) remove the CHECK_NAN_SNIPPET. 2) use below checking

vec4 nanValue = a;
bvec4 isNaN = isnan(a);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

nanValue = b;
isNaN = isnan(b);
result.r = isNaN.r ? nanValue.r : result.r;
result.g = isNaN.g ? nanValue.g : result.g;
result.b = isNaN.b ? nanValue.b : result.b;
result.a = isNaN.a ? nanValue.a : result.a;

return result;

In my latest PR, I did a little changes for method 2) as below, but it fails. I am surprised that above changing can work but below changing doesn't work. Anyway, I will recover my changes to method 2) and let's give the last test (hope it's not flaky). Choosing method 2) is to keep the consistency of functionality. As the next step, we may consider to enable NaN checking behind a flag.

  bvec4 isNaNA = isnan(a);
  bvec4 isNaNB = isnan(b);
  bvec4 isNaN = bvec4(isNaNA.x || isNaNB.x, isNaNA.y || isNaNB.y, isNaNA.z || isNaNB.z, isNaNA.w || isNaNB.w);
  result.r = isNaN.r ? NAN : result.r;
  result.g = isNaN.g ? NAN : result.g;
  result.b = isNaN.b ? NAN : result.b;
  result.a = isNaN.a ? NAN : result.a;

  return result;

qjia7 avatar Sep 16 '22 01:09 qjia7

@vladmandic #6828 Have restored to method 2). Please take another try. I suppose it can be pass since [commits 1 + 2 + 3: with patched kernel ops with new NaN implementation]) was PASS. But if not, our previous test might have some problem, like used the old cached npm. So we mistakenly believed it passed? In this case, please manually remove the NaN checking here as below and test again.

export const CHECK_NAN_SNIPPET_PACKED = `
  result.r = isNaN.r ? nanValue.r : result.r;
  result.g = isNaN.g ? nanValue.g : result.g;
  result.b = isNaN.b ? nanValue.b : result.b;
  result.a = isNaN.a ? nanValue.a : result.a;

becomes

export const CHECK_NAN_SNIPPET_PACKED = '';

qjia7 avatar Sep 16 '22 02:09 qjia7

i've updated the branch and created two tests (both pass on my system):

  1. tfjs from #6828
    https://vladmandic.github.io/tfjs-fix-uniforms/index-v1.html?WEBGL_USE_SHAPES_UNIFORMS=true
  2. tfjs from #6828 with CHECK_NAN_SNIPPET_PACKED = ''
    https://vladmandic.github.io/tfjs-fix-uniforms/index-v2.html?WEBGL_USE_SHAPES_UNIFORMS=true

regarding cache, i don't think its an issue, but just in case make sure that disable cache is set in chrome inspector -> network tab: image

vladmandic avatar Sep 16 '22 12:09 vladmandic

1 image 2 image

shurshilov avatar Sep 16 '22 13:09 shurshilov

@qjia7 as you can see both tests pass without errors, but values are correct only when CHECK_NAN_SNIPPET_PACKED = ''.
otherwise, values are even worse off than before (tensor with feature vector now returns all zeros and tensor with score value is not even within 0..1 range).

vladmandic avatar Sep 16 '22 13:09 vladmandic

I just found a device which can reproduce this issue. I will give a debugging to see if I can have more findings. Please stay tuned.

qjia7 avatar Sep 19 '22 06:09 qjia7