online Use a Worker to decompress and unpremultiply tile data

Resolves: #10052
Target version: master

Summary

This uses a web worker, when available, to decompress and un-premultiply tile image data. In the case of no worker available, or an unexpected error communicating with the worker, it falls back to the previous synchronous path. Anecdotally, I see a significant improvement in keyboard-initiated scrolling performance.

Previous versions of the patch showed significant improvements when profiled, I need to do the same with this version which does differ in ways significant enough that it could impact performance.

TODO

[x] Profile vs master

Checklist

[x] I have run make prettier-write and formatted the code.
[x] All commits have Change-Id
[ ] I have run tests with make check
[x] I have issued make run and manually verified that everything looks okay
[x] Documentation (manuals or wiki) has been updated or is not required

Oct 04 '24 11:10 Cwiiis

Promising results after profiling it vs master, similar to the previous rendition of the patch but with more pronounced results as this version now handles dehydration and never falls back to the synchronous path. I ran a stress test that consisted of the following:

Load a writer document that contains Dostoevsky's "Crime and Punishment" (any similarly large file will do, I believe)
Let the browser settle, then start profiling.
Focus the document and hold Page down.
Release at approximately page 100 and let the browser settle for some seconds.
Hold page up and scroll back to page 1.
Release and let the browser settle for some seconds before ending the profile.

In both cases, this test took me about 30 seconds.

Observations:

On master, 7.5gb of memory is allocated and almost all of it is released. On this branch, 4.5gb of memory is allocated and almost all of it is released, though the retained memory is slightly higher (presumably due to the pre-allocated tile buffers)
On master, there are several long periods of 'jank' (time when the browser misses a screen refresh and is unresponsive to user input) during heavy scrolling. On this branch, there is no registered jank at all.
Both master and this branch have a memory allocation pattern that resembles a jagged, rising edge - this is gradual increased memory allocation up to a point with GC interspersed. Notable that on master branch, the peak memory allocation is much higher (~791MB over the baseline, vs 418MB).
On master, the _queueSlurpEventEmission timeout takes 78% of the time on the profile, it takes 45% on this branch.
Focusing on that timeout handler, on master 54% of time is spent in resumeDrawing, 45% in _onMessage (and 30% in _applyDelta, with half of that 30% spent decompressing). On this branch, 60% of time is spent in resumeDrawing and 38% in _onMessage. Only 4.4% of time is spent in _applyCompressedDelta, which is this branch's equivalent to _applyDelta in this situation. A surprising 13% of time is spent in _onInvalidateCursorMsg, which feels like it might warrant investigation. One might've expected an increase in time spent drawing (54% to 60% of course is still an increase, but not dramatic), but I think these numbers are slightly distorted by the lack of jank meaning that more frames are rendered and there's more time for the browser's idle handler to process other non-drawing events.
There is little difference in the user experience of running this test between both branches. I would say it looks slightly more coherent on the master branch as it blocks for longer and has more of an opportunity to show what its rendered. Another avenue of investigation here is to synchronise page up/down handling with screen updates, it doesn't really make sense to move the scroll position and show blank, unrendered area if a better ordering of events could present a more coherent screen without loss of perceived performance.

Oct 04 '24 14:10 Cwiiis

lets check/see how --enable-bundle affects what the importScripts want to do I guess

Oct 16 '24 10:10 caolanm

As expected, --enable-bundle breaks the worker trying to load the fzstd node module - my current thinking is to use m4 in a similar way to how we generate bundle.js and embed fzstd into CanvasTileWorker.js, as well as in bundle.js. I'm going to work towards this, unless someone tells me it's a terrible idea or there's a better way of doing this :)

[Edited comment to reflect what I meant...]

Oct 16 '24 13:10 Cwiiis

I have a version that works with bundling now (I think), but it seems including fzstd in CanvasTileWorker.js then breaks uglifyjs... Not sure why this is yet, will continue with it.

Oct 16 '24 15:10 Cwiiis

I'm not particularly confident that I've done this the best or cleanest way, but this should now work when bundled (at least, it does locally - verification from someone else would be nice). As I suggested previously, this uses m4 to generate CanvasTileWorker.js and bundles fzstd there.

Oct 16 '24 16:10 Cwiiis

I think there is a glitch in the unrle refactoring, there are various "ERROR: Unknown delta code" errors in the console

Oct 17 '24 16:10 caolanm

I think there is a glitch in the unrle refactoring, there are various "ERROR: Unknown delta code" errors in the console

hmm, I wonder if something changed on the rebase - I didn't get to test it extensively after rebasing, so it could well be broken (but was working fine before, honest!) Hopefully I'll get some time to take a proper look soon. I assume it's easy to reproduce(?)

Oct 17 '24 16:10 Cwiiis

I assume it's easy to reproduce(?)

I can see it by just loading the "test/samples/writer-edit.fodt" that make run prompts and scroll down to the end, and the browser console will have those "ERROR: Unknown delta code" warnings in it.

Oct 22 '24 08:10 caolanm

This wasn't as simple as I thought it'd be but I'm working on it at the moment. The earlier commit is incorrect for not doing offset = unrle, but this was my untested attempt at factoring out that part and the real problem is later on - I'd assumed that keyframe and delta updates are always separate, which is true except when dehydrating, where deltas can follow a keyframe.

The worker path isn't really structured to deal with this due to the pre-allocated buffers - I could double the size of the buffers to allow for this, or separate keyframes and deltas when rehydrating. I think I'm going to attempt the latter as it'll require less memory.

Oct 22 '24 11:10 Cwiiis

I see some weirdness in calc I think

calc-test.webm

Oct 22 '24 13:10 caolanm

I see some weirdness in calc I think

I'm having issues getting calc to work at all at the moment (unrelated to this branch). While I try to debug what's going on there, I can at least suppose about this; assuming there isn't a logic error in the tile update code, the drawing updates are now asynchronous in ways it wasn't before and even in Writer, I notice that you can get a backlog of update events by behaving in a certain way - perhaps we need a way of throttling tile updates so that we don't have too many in flight at one time. Also, the scroll position can update before tiles have finished being processed, we could also do with a way of synchronising that without blocking the DOM thread (maybe a simple callback in CanvasTileLayer would facilitate both?)

I wonder also in the interests of getting this landed and testable, if it could be behind a setting that defaults to off and then we can work towards getting that on by default?

Oct 22 '24 14:10 Cwiiis

I got calc working and can't replicate this exact behaviour... What I think could be happening here is drawing not being paused around updates when the worker buffer queue is empty (i.e. the worker task queue is full) and you end up with updates that should be atomic being done in sequence. I'm going to push a commit that I think will help, but because I can't directly replicate I can't say for sure... Further feedback definitely appreciated.

Oct 22 '24 14:10 Cwiiis

After some discussion and some more stress testing, I realised my reuse of buffers was invalid. I've just pushed a simplification that removes reusable buffers entirely (which is a nice reduction in complexity/size). This has some side-effects:

Some of the GC that would've been pushed to the worker is now back on the main thread
The worker is no longer throttled by the pre-allocated buffer queue (this is probably a good thing, I don't think this was a good place for throttling to happen, though I do think we need some intelligent throttling - more below)

I need to do some profiling to see if that GC has any real effect - if it doesn't it's nice not to have the complication of managing the buffers. Anecdotally, it feels at least as fast as before when scrolling large documents.

With regards to the slow update in calc we see above - I would expect that this latest branch might do a little better, but possibly not. I think we're seeing a side-effect here of freeing up the main thread. My theory is that because before this patch there's so much jank where the main thread is blocked that we request and process fewer tiles, and so when tiles are missing, they're more likely to appear quickly because we request fewer tiles in total. If that's the case, we would need to be more intelligent when requesting tiles (i.e., not requesting tiles while there is a large backlog of to-be-processed tile data, and not requesting tiles while there is a large backlog of unserviced tile requests (the latter being the more likely situation with a remote server)).

There's also the very real possibility that there's just a bug around drawing - either a mismatched pause/resume (I don't think so, this would really show up very quickly) or a resume that isn't actually triggering a draw for some reason (much more likely). Consistent reproduction steps are always good - unfortunately I can't exactly reproduce the above as I only have 1080p monitors (but maybe there's some wayland magic to do virtual super resolution, like you can on Windows with nvidia/AMD drivers?)

Oct 23 '24 14:10 Cwiiis

With my stab at a debugging patch of https://github.com/Cwiiis/cool/compare/canvastileworker...caolanm:cool:canvastileworker to not block in onMessage in the worker thread and make a local queue of work that is processed by a timer it sort of looks like the scenario that the freed up main loop can request decompression of way more tiles than it could before.

e.g. with the above applied and in calc holding down pagedown for a few seconds, and then pageup the queue can grow to thousands of requests

Oct 24 '24 08:10 caolanm

With my stab at a debugging patch of Cwiiis/[email protected]:cool:canvastileworker to not block in onMessage in the worker thread and make a local queue of work that is processed by a timer it sort of looks like the scenario that the freed up main loop can request decompression of way more tiles than it could before.

e.g. with the above applied and in calc holding down pagedown for a few seconds, and then pageup the queue can grow to thousands of requests

Great idea! It would make sense to use an immediate timeout in the worker onmessage and let events accumulate over a tick to process at once. I think there are smarter ways to orchestrate than just a timeout but it's a great first step.

What might be better (or not) would be to accumulate on the main thread side where we have all the context around what the transaction is, then send in one big message - this would remove the need for so much pause/resume logic.

I'm away till next week, will think about it more then.

Oct 24 '24 09:10 Cwiiis

Ok, back on this - my current plan is to alter the flow of events to make it transactional. IIRC, message handling is already transactional in socket.js when emitting events and pauses drawing around that event handling. I would suggest removing pause/resumeDrawing (or at least, augmenting it - I don't know if there are uses elsewhere) and having a begin/endTransaction and a 'complete' callback. That way we can bundle many tile updates together in a single transaction and we can throttle event handling so that we don't end up with a huge queue of messages as the decompression worker struggles to keep up. This would also reduce the comms overhead, though I don't expect that's a big feature in the profile.

This would essentially change the behaviour of this patch queue so that it's the same as it currently is without workers, but the work of decompression/unpremultiply would happen in parallel with whatever else the main thread happens to be doing and not block the UI. Likelihood is the main thread would end up spending some time waiting idle on the worker, but that's ok. We can achieve greater parallelisation by using multiple workers to split the work load, and/or by allowing a certain amount of transactions to be in flight at once (but I expect that this latter suggestion would just end up with the bad behaviour that caolan has highlighted unless we also were smart about re-ordering the work).

Oct 30 '24 11:10 Cwiiis

I have a version of this that uses multiple workers but for whatever reason it isn't working well for me at all - I'll investigate that further, but it would be good to aim for just a single worker to begin with and build from there. This still has most of the benefits of the non-transactional era of the patch with regards to interactive performance.

Oct 30 '24 17:10 Cwiiis

After I load the demo ods then I seem to see a constant load, adding a breakpoint to _queueSlurpEventEmission seems to maybe suggest that when _onWorkerMessage gets something then it triggers _queueSlurpEventEmission which ends up triggering endTransaction and calling postMessage so the sequence never ends.

Maybe a check on this._pendingDeltas having a non-zero length before _worker.postMessage might be enough?

Oct 30 '24 17:10 caolanm

After I load the demo ods then I seem to see a constant load, adding a breakpoint to _queueSlurpEventEmission seems to maybe suggest that when _onWorkerMessage gets something then it triggers _queueSlurpEventEmission which ends up triggering endTransaction and calling postMessage so the sequence never ends.

Maybe a check on this._pendingDeltas having a non-zero length before _worker.postMessage might be enough?

whoops, this was fixed in my multiple-workers branch and I missed the backport - hopefully fine in the most recent push :)

Oct 30 '24 17:10 Cwiiis

I'm just debugging some issues with missed updates on this patch - once I find and solve this issue, I'm satisfied with the general performance of this for now. There are some obvious ways of improving GC on the main thread, but one thing at a time. This still mostly eliminates jank during scrolling.

Oct 31 '24 12:10 Cwiiis

Not quite there yet - took quite a while to track down some subtle behaviour difference with this new transactional approach (tile rehydration happens outside of transactions when scrolling - this affected both worker and non-worker paths), but there's still an issue with the worker path where it seems tiles don't update when they become clear, only when they have content. I've not, so far, tracked down why this is.

Oct 31 '24 17:10 Cwiiis

Very irritatingly, I tracked down the issue (after several hours debugging) and it's a bug in fzstd when using pre-allocated buffers. I'm not exactly sure of the nature of it, but it appears it expects the buffer to be zeroed out - given this is in a worker and we often need to hand off the buffer back to the main thread, I've removed that optimisation. Updated commits incoming.

Oct 31 '24 18:10 Cwiiis

Not quite ready yet, I'm trying to resolve an issue where tiles that haven't been rehydrated try to get immediately drawn (this is orchestrated outside of CanvasTileLayer) - I expect this will need some finessing of the tileNeedsFetch paths to add a kind of tile-needs-local-fetch allowance.

Nov 01 '24 11:11 Cwiiis

The last problem I'm fixing now, I believe, is that sometimes we garbage collect in the middle of drawing, which of course causes havoc when rehydration isn't synchronous. The fact that we forcibly GC things to then immediately recreate them seems like quite bad behaviour to me, but in the interests of this not ballooning more than it has already, I'll see about it just excluding tiles that have pending deltas.

Nov 01 '24 12:11 Cwiiis

This turned out to be a major refactor, but I'm slightly more confident in this approach than previous attempts. The worker and non-worker paths are now broadly equivalent, except that on the worker path, the work happens in another thread and asynchronously. A few things had to change;

We no longer GC tiles that have pending delta updates (while this works in the non-worker case, I believe this would have had a detrimental impact on performance)
We no longer rely on ensureCanvas to rehydrate tiles - they are explicitly rehydrated when they are made current
Tiles are always added to the dirty region when they're updated - I couldn't really work out the logic of how this was meant to work previously, but I don't see why they wouldn't always be added to the dirty region when they've changed With this refactor, I was at least pleased that all the bugs I could manifest on the worker path also manifested on the non-worker path, which made them an awful lot easier to fix.

Nov 01 '24 15:11 Cwiiis

There's still some oddness in calc - I can reproduce it without workers too, though it's easier to reproduce with workers (which isn't ideal). I think I might have to call it a day on this for this week, any further testing/insights appreciated!

Nov 01 '24 16:11 Cwiiis

Constant load is gone now, seems more smoother to scroll around the place in calc, but I still get this sort of an effect:

A3000-pageup-top-top-keep-held.webm

in calc, is that the sort of "oddness in calc" you mean. For me, I can reproduce as above (this recording is with chromium rather than firefox, might be worth checking both browsers if you haven't already) with A4000 in the indicated widget, return, click the cell to its right, and then hold down page up, keep it held even when scrolling has reached the top

Nov 01 '24 16:11 caolanm

Constant load is gone now, seems more smoother to scroll around the place in calc, but I still get this sort of an effect: A3000-pageup-top-top-keep-held.webm

in calc, is that the sort of "oddness in calc" you mean. For me, I can reproduce as above (this recording is with chromium rather than firefox, might be worth checking both browsers if you haven't already) with A4000 in the indicated widget, return, click the cell to its right, and then hold down page up, keep it held even when scrolling has reached the top

Yes, exactly - though that flickering I can reproduce with no workers, it's definitely worse with workers. Sometimes it doesn't settle with the cell visible, though it seems to reappear after a short wait (presumably when the prefetch tiles timeout gets hit). I haven't really managed to reason what's happening here...

I imagine, perhaps, the flickering is off-screen updates being caught up to causing the on-screen tiles to get GC'd? If that was the case, the check in garbageCollect in CanvasTileLayer that excludes tiles with tile.hasPendingDeltas > 0, I wonder if we changed that to checking tile.current instead, if this would go away? This would also explain why I can replicate it even without workers, but why it's much harder to replicate in that case.

Nov 01 '24 16:11 Cwiiis

It looks like my supposition was right, thanks to @caolanm's video reproduction - the flickering is caused by GCing visible tiles while off-screen updates catch up. We could be cleverer about discarding unnecessary work, but also we shouldn't be GC'ing visible tiles. I've updated the patch accordingly.

This was reproduceable in the non-worker path I believe because the server can still catch up later with off-screen updates that we process while we have cached, visible tiles that get GC'd and don't get immediately re-fetched. Harder to trigger, but not impossible.

Nov 01 '24 17:11 Cwiiis

Thanks a lot for your contribution, and congrats on your first pull request merged! Welcome aboard! 🎉🎉🎉

If you haven't decided on your next task yet, take a look at our easy-hacks. We're looking forward to your next pull request! :)

Nov 06 '24 12:11 welcome[bot]

online online copied to clipboard

Use a Worker to decompress and unpremultiply tile data

Summary

TODO

Checklist

online
online copied to clipboard