major difference in performance between sync and async methods to download tensor data
when using webgl backend, tensor.dataSync() is about 35% faster than await tensor.data() when returning larger amounts of data
(in my case, model takes 720x720x3 image as input and produces 720x720x3 image as output)
actual timings are ~100ms for tensor.dataSync() and ~140ms for await tensor.data()
note that there is nothing else executing, this is a single active loop
i wouldn't mind using dataSync(), but then again any sync calls are not compatible with webgpu backend and maintaining two separate codepaths for webgl and webgpu is a no-op
btw, model in question can be found at https://github.com/vladmandic/anime
environment: tfjs 3.19..0 on chrome 103
We once did some investigation on why there is a big perf diff between WebGL data() and dataSync(). The basic finding is below: WebGLRenderingContext.readPixels() is a blocking operation. It waits for all previous rendering operations to finish, then reads data from GPU to CPU, which may block the main thread for quite a long time. To mitigate the situation, in async path (that's TFJS implementation of tensor.data()), we may insert a fence right after rendering operations, then use some timer (For example, setTimeout) to poll fence’s status. If fence is not satisfied, we can do something in parallel. Otherwise, we then readPixels() without a long blocking as rendering operations have been finished. With an ideal timer, fence may unblock the UI, hide the latency using parallel tasks and timely read data back. However, according to html spec (https://html.spec.whatwg.org/multipage/timers-and-user-prompts.html#:~:text=If%20nesting%20level%20is%20greater%20than%205%2C%20and%20timeout%20is%20less%20than%204%2C%20then%20set%20timeout%20to%204), If the nesting level is greater than 5 and timeout is less than 4ms, timeout will be changed to 4ms, which causes a delay of at least 4ms. You have about 40ms diff, so I'm not sure if there are other reasons. To mitigate this, there are 3 possibilities:
- We already found a tricky way to introduce a new setTimeout(), which will be implemented in TFJS (Sorry that this work has been delayed for a while).
- In Chrome Canary, it also changed the nesting threshold from 5 to 100 (https://groups.google.com/a/chromium.org/g/blink-dev/c/-TjeYs7shTQ/m/FhJq0mQyDAAJ?pli=1). You may have a quick try with Chrome Canary (not Stable) to see if your problem can be alleviated.
- Another possible suggestion is you may always keep the async path (sorry that by design, WebGPU is mostly async and WebGL is always sync), but wrap tensor.dataSync() as an async function for WebGL.
More details of our investigation can be found at https://docs.google.com/document/d/13Up4P7Or9uliiU2RZfzOk1aQ5yjFSjG2M6k16BkQn4U/edit. If you may have a check with Chrome Canary, please let me know the result. We'll take a further look if it couldn't greatly alleviate your situation. Thanks!
@gyagp
thanks for the explanation tried with chrome canary 106.0.5206.0
await tensor.data()--> average 107mstensor.dataSync()--> average 99ms
difference is definitely smaller, but still noticeable at 10%
i guess this comes down to polling delay you've mentioned
looking forward to new timeout method
if you have a pr for that, i can monitor and try once its ready?
Thanks for the quick try! It seems 4ms delay in setTimeout() contributes to most of the gap between async and sync calls. Once it's solved, I also think another gap comes from polling, like you mentioned. More specific, browser needs to check the result at some frequency. If that's too frequent, much power will be drained. Currently in Chrome implementation, the frequency of task scheduler is about 1ms (the exact implementation is more complicated), which is the general cost of async calls. In WebGPU path, we already found some perf issue related to GPU task scheduler, and we're fixing that. We will double check if WebGL path can be improved. Stay tuned with the new setTimeout :)
@gyagp Thank you for the detail explanation, we would hope to unify the API between WebGL and WebGPU without major performance discrepancy, looking forward to the new timeout mechanism. thanks.
@vladmandic I have a PR (under review) at https://github.com/tensorflow/tfjs/pull/6694. May I have your help to give a try? Currently the custom setTimeout is gated by an environmental variable "USE_SETTIMEOUTWPM", so you need to enable it first with "tf.env().set('USE_SETTIMEOUTWPM', true);". Thanks!
@gyagp
after enabling USE_SETTIMEOUTWPM, i'm getting errors in tfjs-backend-wasm-threaded-simd.worker.js
worker.js received unknown command undefined
and i'm clearly not using wasm here at all, its just that my tfjs bundle has all backends pre-linked
my guess is that issue is because you're using browsers window object blindly and registering a method on it:
tfjs-core/src/util_base.ts
(window as any).setTimeoutWPM(tryFn, nextBackoff);
registering a function directly on a global namespace is a bad practice
but referencing browsers global namespace without even checking if its running in a browser context is a fail
@gyagp
i've removed tfjs-backend-wasm from my tfjs bundle and test now runs
results show some, but still far from sync data download performance
tensor with shape [1, 720, 720, 3], running model inference 500 times and calculating average time for tensor download:
- using
tensor.dataSync()-> 94 ms - using
await tensor.data()-> 124 ms - using
await tensor.data()withUSE_SETTIMEOUTWPM = true-> 116 ms
repo used for test: https://github.com/vladmandic/anime
Thank you so much @vladmandic ! This is not a very outstanding improvement, but 4ms delay is real and critical for an application that targets 60FPS, or cares about every millisecond. I will polish my PR and try to get it merged. After this, I will investigate the gap between data() and dataSync() again to see if gap can be further filled.
@gyagp
but 4ms delay is real and critical for an application that targets 60FPS, or cares about every millisecond
very true
i've tried USE_SETTIMEOUTWPM on a different use-case,
an app that uses multiple small models so there are multiple await tensor.data() calls for each input frame
each data result is tiny, it's just there are quite a few of them
and performance difference is much bigger - 15-20% overall performance improvement and that is huge:
- average on 500 frames: ~122ms
- average on 500 frames with
USE_SETTIMEOUTWPM: ~103ms
repo used for test: https://github.com/vladmandic/human
I have updated the PR, and it's under final review. @vladmandic May I ask your favor to test it again? I don't want to introduce any regression for your cases. By the way, I changed the flag name, and you need to "tf.env().set('USE_SETTIMEOUTCUSTOM', true);" first.
as soon as @tensorflow/tfjs-backend-wasm is loaded and USE_SETTIMEOUTCUSTOM is enabled, there are problems regardless if WASM is used or not as WASM init code happens on module load
worker.js received unknown command undefined
threadPrintErr @ tfjs.esm.js:81524
self.onmessage @ tfjs.esm.js:81594
postMessage (async)
(anonymous) @ tfjs.esm.js:8814
setTimeout (async)
setTimeoutCustom @ tfjs.esm.js:8813
issue is that worker.js in tfjs-backend-wasm listens to all messages and throws an error because it does not know about this particular message
self.onmessage = (e) => {
...
} else {
err('worker.js received unknown command ' + e.data.cmd);
err(e.data);
}
where err() is just a wrapper around console.error(text) via threadPrintErr()
i don't think that is not a bug in your code, but bad behavior from wasm worker script as it should not treat
any unknown message as error
bigger problem is that WASM worker script is generated by EMScripten and only lightly modified by TFJS build process
actual source for worker.js and offending line is in EMScripten repo:
https://github.com/emscripten-core/emscripten/blob/5c4b1675ca5ba34e283b21b3c01df092827db628/src/worker.js#L288
and it seems this is directly related: https://github.com/emscripten-core/emscripten/pull/16450
@vladmandic I've created a simple sample to try to reproduce the perf difference but the perf result shows that:
- average 'await tensor.data()'on 500 runs: ~53.23ms
- average 'tensor.dataSync()' on 500 runs : ~44.08ms
- average 'await tensor.data()' on 500 runs with USE_SETTIMEOUTWPM: ~39.73ms
If you want to try this, you could follow the steps:
- clone/check to the branch : https://github.com/shaoboyan/tfjs/tree/settimeout
- Build
tfjs-coreintfjs/link-package folderwith commandyarn build-deps-for tfjs-core - Build
tfjs-backend-webglintfjs/link-package folderwith commandyarn build-deps-for tfjs-backend-webgl - Start a server in the folder above tfjs folder with command
npx http-server - Start chrome canary and access
http://127.0.0.1:8080/tfjs/e2e/benchmarks/local-benchmark/index_tfjs.html?backend=webgl&localBuild=webgl,core. Open the console and the result is there.
My code clip is :
await tf.setBackend("webgl");
await tf.ready();
let value = [];
for (var i = 0; i < 1440 * 1440; ++i) {
value[i] = i / 1440;
}
const tensor = tf.tensor2d(value, [1440, 1440]);
const tensor2 = tf.tensor2d(value, [1440, 1440]);
const result = tf.matMul(tensor, tensor2);
const tensorTest = tensor;
//warm up
for (var i = 0; i < 500; ++i) {
const result = tf.matMul(tensor, tensor2);
result.dataSync();
result.dispose();
}
let start = 0; //performance.now();
let total = 0;
tf.env().set('USE_SETTIMEOUTCUSTOM', true);
start = performance.now();
for (var i = 0; i < 500; ++i) {
const result = tf.matMul(tensor, tensor2);
await result.data();
result.dispose();
}
total = performance.now() - start;
console.log('await tensor.data() readback time origin: ' + total / 500);
total = 0;
start = performance.now();
for (var i = 0; i < 500; ++i) {
const result = tf.matMul(tensor, tensor2);
result.dataSync();
result.dispose();
}
//total += performance.now() - start;
total = performance.now() - start;
console.log('tensor.dataSync() readback time: ' + total / 500);
tf.env().set('USE_SETTIMEOUTCUSTOM', false);
total = 0;
start = performance.now();
for (var i = 0; i < 500; ++i) {
const result = tf.matMul(tensor, tensor2);
await result.data();
result.dispose();
}
total = performance.now() - start;
console.log('await tensor.data() readback time custom: ' + total / 500);
I'm not sure what's the difference between this sample and your sample. Do you have some ideas?
And is it possible for me to run your sample on my local so I could do more profile work? If it is possible, your guidance would be appreciate.
@vladmandic, For the background, I asked @shaoboyan to investigate the diff between data and dataSync (116ms vs. 94ms) even if we use the custom setTimeout as you reported below. using tensor.dataSync() -> 94 ms using await tensor.data() -> 124 ms using await tensor.data() with USE_SETTIMEOUTWPM = true -> 116 ms
@gyagp @shaoboyan
i just redid all the tests - seems there is quite a difference depending on:
- timings for initial iterations are all over the place,
and not just 1st inference - results dont stabilize until much later
so i'm now throwing those away and only measure average timings after 100th iterations - timings vary drastically depending if test is synthetic (like your example)
or actually involves model inference (my original example) - if test includes model inference difference in results seem to be due to extra pending ops
on browser'swebglpipeline and nothing to do with tensor download
results
download tensor data after model.execute():
await tensor.data(): 175msawait tensor.data()withUSE_SETTIMEOUTCUSTOM: 170mstensor.dataSync(): 165ms
download tensor data after model.execute() and gl.clientWaitSync:
await tensor.data(): 114msawait tensor.data()withUSE_SETTIMEOUTCUSTOM: 101mstensor.dataSync(): 102ms
download tensor data for manually created/modified tensor:
await tensor.data(): 104msawait tensor.data()withUSE_SETTIMEOUTCUSTOM: 93mstensor.dataSync(): 94ms
all-in-all, using webgl backend impacts result measurement a lot if it webgl pipeline is not taken into consideration
(as @pyu10055 once explained, model.execute() is not fully sync on its own)
other than that, USE_SETTIMEOUTCUSTOM does indeed bring async download to same level of performance as sync
(actually even a touch faster than sync call, but within margin of error as difference is below 1ms)!
btw, this is my code for gl.syncWait:
async function syncWait(gl: WebGL2RenderingContext): Promise<number> {
const sync = gl.fenceSync(gl.SYNC_GPU_COMMANDS_COMPLETE, 0);
if (!sync) return 0;
const ts = performance.now();
return new Promise((resolve) => {
const loop = () => {
const status = gl.clientWaitSync(sync, gl.SYNC_FLUSH_COMMANDS_BIT, 0);
if (status === gl.WAIT_FAILED) throw new Error('clientWaitSync: wait failed');
else if (status === gl.ALREADY_SIGNALED || status === gl.CONDITION_SATISFIED) {
gl.deleteSync(sync);
resolve(performance.now() - ts);
} else {
setTimeout(loop, 0);
}
};
loop();
});
}
and simplified code example would be:
const tensor = model.execute(input);
await syncWait(tf.backend().getGPGPUContext().gl);
const timeStamp = performance.now();
const data = await tensor.data();
totalTime += (performance.now() - timeStamp);
log({ avgDataTime: totalTime / ++numFrames });
@vladmandic
Thanks for trying and the detailed data! But I still has concern due to the previous model average time. I'm a bit suspect that the async download might cause the perf diff due to the different scheduling method in browser.
So would you mind to test again by including the model.execute(input) and testing after several warmup rounds ? If the gap still exists, I think a profile to understand this is still worthwhile.
@shaoboyan i'm not sure what is the ask - i've already done that - not just several warmup rounds, but 100 and then measuring average between rounds 100 to 200. yes, gap still exists, but a) its much smaller, up to a point of close enough, b) its due to webgl pipeline also performing timer checks which interferes with browser handling of ticks.
my much bigger concern is blocking issue of new code being in conflict with tjfs-backend-wasm.
b) its due to webgl pipeline also performing timer checks which interferes with browser handling of ticks.
@vladmandic Thanks. I misunderstand your tests. I think this reason resolves my concern.
as soon as
@tensorflow/tfjs-backend-wasmis loaded andUSE_SETTIMEOUTCUSTOMis enabled, there are problems regardless if WASM is used or not as WASM init code happens on module loadworker.js received unknown command undefined threadPrintErr @ tfjs.esm.js:81524 self.onmessage @ tfjs.esm.js:81594 postMessage (async) (anonymous) @ tfjs.esm.js:8814 setTimeout (async) setTimeoutCustom @ tfjs.esm.js:8813issue is that
worker.jsintfjs-backend-wasmlistens to all messages and throws an error because it does not know about this particular messageself.onmessage = (e) => { ... } else { err('worker.js received unknown command ' + e.data.cmd); err(e.data); }where
err()is just a wrapper aroundconsole.error(text)viathreadPrintErr()i don't think that is not a bug in your code, but bad behavior from wasm worker script as it should not treat any unknown message as error
bigger problem is that WASM worker script is generated by EMScripten and only lightly modified by TFJS build process
actual source for
worker.jsand offending line is in EMScripten repo: https://github.com/emscripten-core/emscripten/blob/5c4b1675ca5ba34e283b21b3c01df092827db628/src/worker.js#L288and it seems this is directly related: emscripten-core/emscripten#16450
I tried e2e benchmark in TFJS on both Windows and macOS. However, I couldn't repro the issue you mentioned. I double checked the emscripten source code (_bazel_user/qpnosdr3/external/emscripten_bin_win/emscripten/src/library_pthread.js) we use now, and it's still the problematic one, and I think its version is 3.1.4. Emscripten fixed this problem (https://github.com/emscripten-core/emscripten/pull/16450) on Mar 10, and the first release after the fix is 3.1.8 on Mar 24. This is still a valid problem. @pyu10055 @mattsoulanille (should be still on vacation now), do you think if we can upgrade emsdk version now?
@gyagp i'm not sure upgrading emscripten would solve it as fix https://github.com/emscripten-core/emscripten/pull/16450 solved it for https://github.com/emscripten-core/emscripten/blob/main/src/library_pthread.js#L290, but it still needs a same fix for https://github.com/emscripten-core/emscripten/blob/main/src/worker.js#L287 -those are two different places in the code where same problem can occur
@gyagp any progress on emscripten worker.js issue i've mentioned - it uses global onmessage handler so it intercepts all messages including the one you're using.
@vladmandic Sorry, I was occupied by some other things. @mattsoulanille, do you have any suggestion on comment 17?
@gyagp Maybe we can register our listener before Emscripten and suppress propagation of our event? However, I'm not sure if this is possible in a webworker created by Emscripten. Ideally, there would be something similar to CustomEvents that we could use as a unique event (we can't use window.dispatchEvent because it is synchronous).
I also updated Emscripten to 3.1.20, although this might not solve the issue.
@mattsoulanille @gyagp i just created emscripten issue for this as this is really something that should be patched there - imo anything you do here is just a workaround to avoid bad emscripten behavior
@mattsoulanille Thanks and let's wait for Emscripten's feedback at https://github.com/emscripten-core/emscripten/issues/17844.
Will MessageChannel fit your needs here?
MessageChannel looks like it will work. Thanks @hujiajie for pointing it out to us! @gyagp, we should try this out as a drop-in replacement for window.postMessage.
A quick test:
lastTime = new Date().getTime();
i = 0;
function time() {
i++;
if (i % 1000 === 0) {
currentTime = new Date().getTime();
delta = currentTime - lastTime;
console.log(`1000 iterations in ${delta} ms. ${delta / 1000} ms per iteration`);
lastTime = currentTime;
}
}
channel = new MessageChannel()
channel.port1.onmessage = () => {
channel.port2.postMessage('hello');
time();
};
channel.port2.postMessage('hello')
This looks like it has no 4ms delay (while the same code under a setTimeout 'loop' does):
1000 iterations in 9 ms. 0.009 ms per iteration
VM5549:8 1000 iterations in 11 ms. 0.011 ms per iteration
VM5549:8 1000 iterations in 9 ms. 0.009 ms per iteration
VM5549:8 1000 iterations in 10 ms. 0.01 ms per iteration
VM5549:8 1000 iterations in 11 ms. 0.011 ms per iteration
2VM5549:8 1000 iterations in 9 ms. 0.009 ms per iteration
imo, MessageChannel looks nice, but unlike window.postMessage, its not in-process - it actually uses sockets on both sides.
perhaps not an issue (although it does seem more heavy-weight), but do make sure that there is a single reused MessageChannel, you don't want to exhaust ports stuck in half-close state.
@mattsoulanille @gyagp emscripten pr has been merged to main, so now it should be just a question of upgrading to new version once its out.
Thanks for the update @vladmandic, and for the extra info on MessageChannel. I agree that it would be better to stay in-process. Once emsdk 3.1.22 is published (and it looks like that will be soon since there's a PR open for it), I'll update it in TFJS.
Sorry for the late response. MessageChannel also works and I updated my test (https://github.com/webatintel/webatintel.github.io/blob/main/timer.html#L44) to double confirm this. Given @vladmandic already fixed the issue in Emscripten upstream (thank you so much for the effort!), let's keep the current solution.