tfjs major difference in performance between sync and async methods to download tensor data

when using webgl backend, tensor.dataSync() is about 35% faster than await tensor.data() when returning larger amounts of data
(in my case, model takes 720x720x3 image as input and produces 720x720x3 image as output)

actual timings are ~100ms for tensor.dataSync() and ~140ms for await tensor.data()
note that there is nothing else executing, this is a single active loop

i wouldn't mind using dataSync(), but then again any sync calls are not compatible with webgpu backend and maintaining two separate codepaths for webgl and webgpu is a no-op

btw, model in question can be found at https://github.com/vladmandic/anime

environment: tfjs 3.19..0 on chrome 103

Jul 28 '22 01:07 vladmandic

We once did some investigation on why there is a big perf diff between WebGL data() and dataSync(). The basic finding is below: WebGLRenderingContext.readPixels() is a blocking operation. It waits for all previous rendering operations to finish, then reads data from GPU to CPU, which may block the main thread for quite a long time. To mitigate the situation, in async path (that's TFJS implementation of tensor.data()), we may insert a fence right after rendering operations, then use some timer (For example, setTimeout) to poll fence’s status. If fence is not satisfied, we can do something in parallel. Otherwise, we then readPixels() without a long blocking as rendering operations have been finished. With an ideal timer, fence may unblock the UI, hide the latency using parallel tasks and timely read data back. However, according to html spec (https://html.spec.whatwg.org/multipage/timers-and-user-prompts.html#:~:text=If%20nesting%20level%20is%20greater%20than%205%2C%20and%20timeout%20is%20less%20than%204%2C%20then%20set%20timeout%20to%204), If the nesting level is greater than 5 and timeout is less than 4ms, timeout will be changed to 4ms, which causes a delay of at least 4ms. You have about 40ms diff, so I'm not sure if there are other reasons. To mitigate this, there are 3 possibilities:

We already found a tricky way to introduce a new setTimeout(), which will be implemented in TFJS (Sorry that this work has been delayed for a while).
In Chrome Canary, it also changed the nesting threshold from 5 to 100 (https://groups.google.com/a/chromium.org/g/blink-dev/c/-TjeYs7shTQ/m/FhJq0mQyDAAJ?pli=1). You may have a quick try with Chrome Canary (not Stable) to see if your problem can be alleviated.
Another possible suggestion is you may always keep the async path (sorry that by design, WebGPU is mostly async and WebGL is always sync), but wrap tensor.dataSync() as an async function for WebGL.

More details of our investigation can be found at https://docs.google.com/document/d/13Up4P7Or9uliiU2RZfzOk1aQ5yjFSjG2M6k16BkQn4U/edit. If you may have a check with Chrome Canary, please let me know the result. We'll take a further look if it couldn't greatly alleviate your situation. Thanks!

Jul 28 '22 11:07 gyagp

@gyagp

thanks for the explanation tried with chrome canary 106.0.5206.0

await tensor.data() --> average 107ms
tensor.dataSync() --> average 99ms

difference is definitely smaller, but still noticeable at 10%
i guess this comes down to polling delay you've mentioned

looking forward to new timeout method
if you have a pr for that, i can monitor and try once its ready?

Jul 28 '22 13:07 vladmandic

Thanks for the quick try! It seems 4ms delay in setTimeout() contributes to most of the gap between async and sync calls. Once it's solved, I also think another gap comes from polling, like you mentioned. More specific, browser needs to check the result at some frequency. If that's too frequent, much power will be drained. Currently in Chrome implementation, the frequency of task scheduler is about 1ms (the exact implementation is more complicated), which is the general cost of async calls. In WebGPU path, we already found some perf issue related to GPU task scheduler, and we're fixing that. We will double check if WebGL path can be improved. Stay tuned with the new setTimeout :)

Jul 28 '22 15:07 gyagp

@gyagp Thank you for the detail explanation, we would hope to unify the API between WebGL and WebGPU without major performance discrepancy, looking forward to the new timeout mechanism. thanks.

Jul 28 '22 21:07 pyu10055

@vladmandic I have a PR (under review) at https://github.com/tensorflow/tfjs/pull/6694. May I have your help to give a try? Currently the custom setTimeout is gated by an environmental variable "USE_SETTIMEOUTWPM", so you need to enable it first with "tf.env().set('USE_SETTIMEOUTWPM', true);". Thanks!

Aug 03 '22 21:08 gyagp

@gyagp

after enabling USE_SETTIMEOUTWPM, i'm getting errors in tfjs-backend-wasm-threaded-simd.worker.js

worker.js received unknown command undefined

and i'm clearly not using wasm here at all, its just that my tfjs bundle has all backends pre-linked

my guess is that issue is because you're using browsers window object blindly and registering a method on it:

tfjs-core/src/util_base.ts

(window as any).setTimeoutWPM(tryFn, nextBackoff);

registering a function directly on a global namespace is a bad practice
but referencing browsers global namespace without even checking if its running in a browser context is a fail

Aug 03 '22 22:08 vladmandic

@gyagp

i've removed tfjs-backend-wasm from my tfjs bundle and test now runs
results show some, but still far from sync data download performance

tensor with shape [1, 720, 720, 3], running model inference 500 times and calculating average time for tensor download:

using tensor.dataSync() -> 94 ms
using await tensor.data() -> 124 ms
using await tensor.data() with USE_SETTIMEOUTWPM = true -> 116 ms

repo used for test: https://github.com/vladmandic/anime

Aug 03 '22 22:08 vladmandic

Thank you so much @vladmandic ! This is not a very outstanding improvement, but 4ms delay is real and critical for an application that targets 60FPS, or cares about every millisecond. I will polish my PR and try to get it merged. After this, I will investigate the gap between data() and dataSync() again to see if gap can be further filled.

Aug 04 '22 06:08 gyagp

@gyagp

but 4ms delay is real and critical for an application that targets 60FPS, or cares about every millisecond

very true

i've tried USE_SETTIMEOUTWPM on a different use-case,
an app that uses multiple small models so there are multiple await tensor.data() calls for each input frame
each data result is tiny, it's just there are quite a few of them

and performance difference is much bigger - 15-20% overall performance improvement and that is huge:

average on 500 frames: ~122ms
average on 500 frames with USE_SETTIMEOUTWPM: ~103ms

repo used for test: https://github.com/vladmandic/human

Aug 04 '22 12:08 vladmandic

I have updated the PR, and it's under final review. @vladmandic May I ask your favor to test it again? I don't want to introduce any regression for your cases. By the way, I changed the flag name, and you need to "tf.env().set('USE_SETTIMEOUTCUSTOM', true);" first.

Aug 09 '22 08:08 gyagp

as soon as @tensorflow/tfjs-backend-wasm is loaded and USE_SETTIMEOUTCUSTOM is enabled, there are problems regardless if WASM is used or not as WASM init code happens on module load

worker.js received unknown command undefined
threadPrintErr @ tfjs.esm.js:81524
self.onmessage @ tfjs.esm.js:81594
postMessage (async)
(anonymous) @ tfjs.esm.js:8814
setTimeout (async)
setTimeoutCustom @ tfjs.esm.js:8813

issue is that worker.js in tfjs-backend-wasm listens to all messages and throws an error because it does not know about this particular message

self.onmessage = (e) => {
...
} else {
  err('worker.js received unknown command ' + e.data.cmd);
  err(e.data);
}

where err() is just a wrapper around console.error(text) via threadPrintErr()

i don't think that is not a bug in your code, but bad behavior from wasm worker script as it should not treat
any unknown message as error

bigger problem is that WASM worker script is generated by EMScripten and only lightly modified by TFJS build process

actual source for worker.js and offending line is in EMScripten repo:
https://github.com/emscripten-core/emscripten/blob/5c4b1675ca5ba34e283b21b3c01df092827db628/src/worker.js#L288

and it seems this is directly related: https://github.com/emscripten-core/emscripten/pull/16450

Aug 09 '22 11:08 vladmandic

@vladmandic I've created a simple sample to try to reproduce the perf difference but the perf result shows that:

average 'await tensor.data()'on 500 runs: ~53.23ms
average 'tensor.dataSync()' on 500 runs : ~44.08ms
average 'await tensor.data()' on 500 runs with USE_SETTIMEOUTWPM: ~39.73ms

If you want to try this, you could follow the steps:

clone/check to the branch : https://github.com/shaoboyan/tfjs/tree/settimeout
Build tfjs-core in tfjs/link-package folder with command yarn build-deps-for tfjs-core
Build tfjs-backend-webgl in tfjs/link-package folder with command yarn build-deps-for tfjs-backend-webgl
Start a server in the folder above tfjs folder with command npx http-server
Start chrome canary and access http://127.0.0.1:8080/tfjs/e2e/benchmarks/local-benchmark/index_tfjs.html?backend=webgl&localBuild=webgl,core. Open the console and the result is there.

My code clip is :

 await tf.setBackend("webgl");
      await tf.ready();
      let value = [];
      for (var i = 0; i < 1440 * 1440; ++i) {
        value[i] = i / 1440;
      }

      const tensor = tf.tensor2d(value, [1440, 1440]);
      const tensor2 = tf.tensor2d(value, [1440, 1440]);
      const result = tf.matMul(tensor, tensor2);
      const tensorTest = tensor;

      //warm up
      for (var i = 0; i < 500; ++i) {
        const result = tf.matMul(tensor, tensor2);
        result.dataSync();
        result.dispose();
      }

      let start = 0; //performance.now();
      let total = 0;

      tf.env().set('USE_SETTIMEOUTCUSTOM', true);
      start = performance.now();
      for (var i = 0; i < 500; ++i) {
        const result = tf.matMul(tensor, tensor2);
        await result.data();
        result.dispose();
      }
      total = performance.now() - start;
      console.log('await tensor.data() readback time origin: ' + total / 500);

      total = 0;
      start = performance.now();
      for (var i = 0; i < 500; ++i) {
        const result = tf.matMul(tensor, tensor2);
        result.dataSync();
        result.dispose();
      }
      //total += performance.now() - start;
      total = performance.now() - start;
      console.log('tensor.dataSync() readback time: ' + total / 500);

      tf.env().set('USE_SETTIMEOUTCUSTOM', false);
      total = 0;
      start = performance.now();
      for (var i = 0; i < 500; ++i) {
        const result = tf.matMul(tensor, tensor2);
        await result.data();
        result.dispose();
      }
      total = performance.now() - start;
      console.log('await tensor.data() readback time custom: ' + total / 500);

I'm not sure what's the difference between this sample and your sample. Do you have some ideas?

And is it possible for me to run your sample on my local so I could do more profile work? If it is possible, your guidance would be appreciate.

Aug 10 '22 08:08 shaoboyan

@vladmandic, For the background, I asked @shaoboyan to investigate the diff between data and dataSync (116ms vs. 94ms) even if we use the custom setTimeout as you reported below. using tensor.dataSync() -> 94 ms using await tensor.data() -> 124 ms using await tensor.data() with USE_SETTIMEOUTWPM = true -> 116 ms

Aug 10 '22 09:08 gyagp

@gyagp @shaoboyan

i just redid all the tests - seems there is quite a difference depending on:

timings for initial iterations are all over the place,
and not just 1st inference - results dont stabilize until much later
so i'm now throwing those away and only measure average timings after 100th iterations
timings vary drastically depending if test is synthetic (like your example)
or actually involves model inference (my original example)
if test includes model inference difference in results seem to be due to extra pending ops
on browser's webgl pipeline and nothing to do with tensor download

results

download tensor data after model.execute():

await tensor.data(): 175ms
await tensor.data() with USE_SETTIMEOUTCUSTOM: 170ms
tensor.dataSync(): 165ms

download tensor data after model.execute() and gl.clientWaitSync:

await tensor.data(): 114ms
await tensor.data() with USE_SETTIMEOUTCUSTOM: 101ms
tensor.dataSync(): 102ms

download tensor data for manually created/modified tensor:

await tensor.data(): 104ms
await tensor.data() with USE_SETTIMEOUTCUSTOM: 93ms
tensor.dataSync(): 94ms

all-in-all, using webgl backend impacts result measurement a lot if it webgl pipeline is not taken into consideration
(as @pyu10055 once explained, model.execute() is not fully sync on its own)

other than that, USE_SETTIMEOUTCUSTOM does indeed bring async download to same level of performance as sync
(actually even a touch faster than sync call, but within margin of error as difference is below 1ms)!

btw, this is my code for gl.syncWait:

async function syncWait(gl: WebGL2RenderingContext): Promise<number> {
  const sync = gl.fenceSync(gl.SYNC_GPU_COMMANDS_COMPLETE, 0);
  if (!sync) return 0;
  const ts = performance.now();
  return new Promise((resolve) => {
    const loop = () => {
      const status = gl.clientWaitSync(sync, gl.SYNC_FLUSH_COMMANDS_BIT, 0);
      if (status === gl.WAIT_FAILED) throw new Error('clientWaitSync: wait failed');
      else if (status === gl.ALREADY_SIGNALED || status === gl.CONDITION_SATISFIED) {
        gl.deleteSync(sync);
        resolve(performance.now() - ts);
      } else {
        setTimeout(loop, 0);
      }
    };
    loop();
  });
}

and simplified code example would be:

const tensor = model.execute(input);
await syncWait(tf.backend().getGPGPUContext().gl);
const timeStamp = performance.now();
const data = await tensor.data();
totalTime += (performance.now() - timeStamp);
log({ avgDataTime: totalTime / ++numFrames });

Aug 10 '22 12:08 vladmandic

@vladmandic Thanks for trying and the detailed data! But I still has concern due to the previous model average time. I'm a bit suspect that the async download might cause the perf diff due to the different scheduling method in browser. So would you mind to test again by including the model.execute(input) and testing after several warmup rounds ? If the gap still exists, I think a profile to understand this is still worthwhile.

Aug 11 '22 02:08 shaoboyan

@shaoboyan i'm not sure what is the ask - i've already done that - not just several warmup rounds, but 100 and then measuring average between rounds 100 to 200. yes, gap still exists, but a) its much smaller, up to a point of close enough, b) its due to webgl pipeline also performing timer checks which interferes with browser handling of ticks.

my much bigger concern is blocking issue of new code being in conflict with tjfs-backend-wasm.

Aug 11 '22 03:08 vladmandic

b) its due to webgl pipeline also performing timer checks which interferes with browser handling of ticks.

@vladmandic Thanks. I misunderstand your tests. I think this reason resolves my concern.

Aug 11 '22 05:08 shaoboyan

as soon as @tensorflow/tfjs-backend-wasm is loaded and USE_SETTIMEOUTCUSTOM is enabled, there are problems regardless if WASM is used or not as WASM init code happens on module load
worker.js received unknown command undefined
threadPrintErr @ tfjs.esm.js:81524
self.onmessage @ tfjs.esm.js:81594
postMessage (async)
(anonymous) @ tfjs.esm.js:8814
setTimeout (async)
setTimeoutCustom @ tfjs.esm.js:8813
issue is that worker.js in tfjs-backend-wasm listens to all messages and throws an error because it does not know about this particular message
self.onmessage = (e) => {
...
} else {
  err('worker.js received unknown command ' + e.data.cmd);
  err(e.data);
}
where err() is just a wrapper around console.error(text) via threadPrintErr()

i don't think that is not a bug in your code, but bad behavior from wasm worker script as it should not treat any unknown message as error

bigger problem is that WASM worker script is generated by EMScripten and only lightly modified by TFJS build process

actual source for worker.js and offending line is in EMScripten repo: https://github.com/emscripten-core/emscripten/blob/5c4b1675ca5ba34e283b21b3c01df092827db628/src/worker.js#L288

and it seems this is directly related: emscripten-core/emscripten#16450

I tried e2e benchmark in TFJS on both Windows and macOS. However, I couldn't repro the issue you mentioned. I double checked the emscripten source code (_bazel_user/qpnosdr3/external/emscripten_bin_win/emscripten/src/library_pthread.js) we use now, and it's still the problematic one, and I think its version is 3.1.4. Emscripten fixed this problem (https://github.com/emscripten-core/emscripten/pull/16450) on Mar 10, and the first release after the fix is 3.1.8 on Mar 24. This is still a valid problem. @pyu10055 @mattsoulanille (should be still on vacation now), do you think if we can upgrade emsdk version now?

Aug 19 '22 09:08 gyagp

@gyagp i'm not sure upgrading emscripten would solve it as fix https://github.com/emscripten-core/emscripten/pull/16450 solved it for https://github.com/emscripten-core/emscripten/blob/main/src/library_pthread.js#L290, but it still needs a same fix for https://github.com/emscripten-core/emscripten/blob/main/src/worker.js#L287 -those are two different places in the code where same problem can occur

Aug 19 '22 12:08 vladmandic

@gyagp any progress on emscripten worker.js issue i've mentioned - it uses global onmessage handler so it intercepts all messages including the one you're using.

Aug 30 '22 18:08 vladmandic

@vladmandic Sorry, I was occupied by some other things. @mattsoulanille, do you have any suggestion on comment 17?

Sep 05 '22 12:09 gyagp

@gyagp Maybe we can register our listener before Emscripten and suppress propagation of our event? However, I'm not sure if this is possible in a webworker created by Emscripten. Ideally, there would be something similar to CustomEvents that we could use as a unique event (we can't use window.dispatchEvent because it is synchronous).

I also updated Emscripten to 3.1.20, although this might not solve the issue.

Sep 13 '22 18:09 mattsoulanille

@mattsoulanille @gyagp i just created emscripten issue for this as this is really something that should be patched there - imo anything you do here is just a workaround to avoid bad emscripten behavior

Sep 13 '22 20:09 vladmandic

@mattsoulanille Thanks and let's wait for Emscripten's feedback at https://github.com/emscripten-core/emscripten/issues/17844.

Sep 14 '22 00:09 gyagp

Will MessageChannel fit your needs here?

Sep 14 '22 06:09 hujiajie

MessageChannel looks like it will work. Thanks @hujiajie for pointing it out to us! @gyagp, we should try this out as a drop-in replacement for window.postMessage.

A quick test:

lastTime = new Date().getTime();
i = 0;
function time() {
    i++;
    if (i % 1000 === 0) {
        currentTime = new Date().getTime();
        delta = currentTime - lastTime;
        console.log(`1000 iterations in ${delta} ms. ${delta / 1000} ms per iteration`);
        lastTime = currentTime;
    }
}

channel = new MessageChannel()
channel.port1.onmessage = () => {
    channel.port2.postMessage('hello');
    time();
};
channel.port2.postMessage('hello')

This looks like it has no 4ms delay (while the same code under a setTimeout 'loop' does):

1000 iterations in 9 ms. 0.009 ms per iteration
VM5549:8 1000 iterations in 11 ms. 0.011 ms per iteration
VM5549:8 1000 iterations in 9 ms. 0.009 ms per iteration
VM5549:8 1000 iterations in 10 ms. 0.01 ms per iteration
VM5549:8 1000 iterations in 11 ms. 0.011 ms per iteration
2VM5549:8 1000 iterations in 9 ms. 0.009 ms per iteration

Sep 14 '22 17:09 mattsoulanille

imo, MessageChannel looks nice, but unlike window.postMessage, its not in-process - it actually uses sockets on both sides. perhaps not an issue (although it does seem more heavy-weight), but do make sure that there is a single reused MessageChannel, you don't want to exhaust ports stuck in half-close state.

Sep 14 '22 17:09 vladmandic

@mattsoulanille @gyagp emscripten pr has been merged to main, so now it should be just a question of upgrading to new version once its out.

Sep 17 '22 02:09 vladmandic

Thanks for the update @vladmandic, and for the extra info on MessageChannel. I agree that it would be better to stay in-process. Once emsdk 3.1.22 is published (and it looks like that will be soon since there's a PR open for it), I'll update it in TFJS.

Sep 17 '22 08:09 mattsoulanille

Sorry for the late response. MessageChannel also works and I updated my test (https://github.com/webatintel/webatintel.github.io/blob/main/timer.html#L44) to double confirm this. Given @vladmandic already fixed the issue in Emscripten upstream (thank you so much for the effort!), let's keep the current solution.

Sep 23 '22 09:09 gyagp