three.js
                                
                                 three.js copied to clipboard
                                
                                    three.js copied to clipboard
                            
                            
                            
                        WebGLRenderer: Async Readback API - WIP
Aims to solve #22779
not meant to be merged in it's current state
I have developed a temporary solution, that works for any intended usage of asynchronous readback using three.js. This PR is meant to illustrate the performance gains that come with this features. Alongside this temporary API, this PR also includes two examples of how users might utilize it.
GPU PICKING
Being the most common use of readPixel. Where the user simply request a single asychronous readback per frame. This use-case and API is pretty much identical to the one we currently use. With the only exception being that the method returns a Promise, which is later resolved to inform when the buffer is ready to be used by the CPU.
if ( renderer.capabilities.isWebGL2 ) {
	renderer.readRenderTargetPixelsAsync( pickingTexture, 0, 0, 1, 1, pixelBuffer )
		.then( () => processPick( pixelBuffer ) );
} else {
	renderer.readRenderTargetPixels( pickingTexture, 0, 0, 1, 1, pixelBuffer );
	processPick( pixelBuffer );
}
GPGPU ASYNC-READBACK
Now the more complex use-case, and most interesting - imo. The use of asynchronous readback on GPGPU pipelines, where multiple readPixels calls are made within a frame. This use-case is also handled correctly, however I could not think of a proper API that kept things simple and similar to our current approach. There are many reasons why, but it boils down to the following:
- A single underlying glBufferis not sufficient to handle multiple calls within a frame.
- Extra care is needed when handling readPixelsto determine when afenceSyncis needed and when it's not.
- There's an inherit need to decouple readPixelscalls from the actual read-back proceduregetBufferSubData.
- WebGLRendereris not the proper place to implement the feature anymore.
With that being said, the proposed API does work and implements exactly the first 3 items:
It is pretty obvious how powerful this feature is, so it's likely worth the effort of coming up with a new API. Which I hope to find with the help of the community.
I believe most of my difficulty in integrating this feature to our current API, stems from the fact that we try to hide the gl context and most of the WebGL objects from being used directly by the user. In the end, I just exposed a method for creating and disposing a specialized PIXEL PACK BUFFER buffer. However I do believe it is possible to hide this, by adding another layer of abstraction on top, much like we do with a lot of the other objects.
One reason, why I'm emphasizing an API rework, is because this specific feature will go hand-in-hand with the upcoming WebGPU compute pipeline, so I deem it to be worth the extra effort elaborating it. In order to have an easier path ahead. Idealizing a specialized component inside WebGLRenderer - let's call it  WebGLTransfer - similar to how we approach WebGLUniform, WebGLBindingStates and so on; seems like the best course of actions. Most of the modern lower graphics implementations, like Vulkan, WebGPU and Metal, have analogous concepts.
I don't have as much free time as I wish I had. So if someone wants to patch accordingly to what is decided. Please, feel free to do it. I'll gladly review the code and comment on it. If not, I'll slowly work on it
I am happy to help champion compute and readPixels APIs if we can find a reasonable design to work with (related: #14503, #21934). I have experimented with transform feedback (example) as a mechanism for compute in WebGL 2, but it's vertex-shader only. Providing async readPixels APIs via fenceSync seems more reasonable in the near term for GPGPU pixel shaders and as an intermediary step for transform feedback should that be realized.
I agree, that's my primary understanding and goal as well, and I have a few ideas...
- 
I thought of implementing a virtual mapping controller WebGLTransfer/Transformcomponent, that would keep track of specialized buffer states, pack/unpack and handle general GPU/CPU data transfer. In the future, I can see this also being extended to work underneathBufferAttributesand VAOs, which should feel more natural as a whole.
- 
Secondly, if we want carefully timed asynchronous communication between CPU & GPU - and trust me, we do. We're gonna need an AsyncManager. There's no two ways about it, this is where the magic happens. We are gonna enqueue async calls, handle communication bindings offenceSync, timed responses, debounces, so on... Basically, we're gonna keep track of active enqueued task and dispatch them appropriately, with a fence control similar to the PR.
return new Promise((resolveProbe, rejectProbe) => {
    function probe() {
        switch ( gl.clientWaitSync( sync, gl.SYNC_FLUSH_COMMANDS_BIT,  0 ) ) {
            case gl.WAIT_FAILED:
                rejectProbe(); break;
            case gl.TIMEOUT_EXPIRED:
                setTimeout(probe, interval); break;
            default:
                resolveProbe();
        }
    }
    setTimeout(probe);
 });
We could explore using microtasks instead of blank setTimeouts, but we would need a Polyfill for now - because of Safari 😠 Again, I can see this being used as a proper async manager API for Three.js in the future. Like a proper dynamic loading, compile, worker-spawn API.
In the past we shied away of implementing a lot of these features, because it was judged to be in the realm of what users should implement. However WebGL2 offers many of these from the underlying implementation itself. So in my eyes, we can use this opportunity, to introduce how these concepts work on a more niche feature, and let it naturally get mixed with core components.
This is my current view of how we should path these changes, However, If @mrdoob and others, would prefer a more immediate PR to get this on production. We could just obfuscate the glBuffers objects from the user, and fix the number of parameters that we currently have on readPixels. Perhaps by moving most of the regular parameters to a new WebGLSampler, that encapsulates framebuffer scissored-view as well as bound glBuffer methods. Leading to a farly simpler API:
renderer.readPixel( computeSampler, typedarray );
renderer.readPixelsAsync( computeSampler, { 
        sync: true, // should enqueue a fence
        interval: 10 // ms - debounce
        readback: typedarray, // if undefined for just fetch, no copy
} );
I know the full-fledged async pipeline scares some people, but it's honestly not too bad for the end-user.
If you are on a timed running loop ( main-thread, with requestAnimationFrame ), you simply attach/register a task with the AsyncQueue subsystem. Being analogous to a WebGPU compute pipeline, with descriptor instantiated pipelines attach to the gpu-queue. We would need to handle execution of these with registered yield/debounce options/descriptors, and associated setTimeout and microtasks dispatch system.
This means we need to handle it through instantiation parameters, the presets and task priority-queue pop. Here's an example of using said API:
// init-time
if ( renderer.capabilities.isWebGL2 ) {
    asyncQueue = new THREE.AsyncQueue();
    computeSampler = new THREE.WebGLSampler( renderTarget, { bounds: THREE.Box2( /**/ ) } );
    transferSampler = new THREE.WebGLSampler( renderTarget, { bounds: THREE.Box2( /**/ ) } );
}
// loop time 
renderer.render( scene, camera );
asyncQueue.readPixels( computeSampler, 
  typedarray, { // [==] undefined - no sync/copy, only flush gpu-queue
    yieldable: false,  // [?Boolean/Number] blocks main-thread on callback
    debounce: 6,       // [?Number] miliseconds - sync probe interval 
} ).then( () => {  /* . . .  */  } );
//  and/or spread tasks
asyncQueue.fetchPixels( computeSampler, { 
    debounce: 6,
    // what should be the default debounce strategy? very application dependant.
    // provide regular schemes / prediction implementations for ease of use.
    // the world is your oyster on this optional
} ).then( () => {  /* . . .  */  } );
asyncQueue.copyPixels( transferSampler, typedarray { 
    yieldable: 4,   // [?Number] -  sub-task yield/generator
    // If predicts insufficient time-window ( less then .yieldable )
    // copy only partial data to typedarray, hold sampler priority on queue for next frame
} ).then( () => {  /* . . .  */  } );
If the proposed path is alright, starting with an initial PR implementing WebGLSampler and underneath state handler WebGLTransfer would help me a lot, I can work on the async queue on my own, since I do believe it has more intricate functionality.
Updated Example - debounced & dispatch frequency, mobile-friendly.
We could explore using microtasks instead of blank setTimeouts, but we would need a Polyfill for now - because of Safari 😠
Would a fallback to setTimeout work instead? How much extra code would that add?
Not much, I've seen similar in react's scheduler, etc.
if ( typeof window.queueMicrotask !== 'function' ) {
  window.queueMicrotask = function ( callback ) {
    Promise.resolve()
      .then( callback )
      .catch( error => setTimeout( () => { throw error } ) );
  };
}
We could explore using microtasks instead of blank setTimeouts, but we would need a Polyfill for now - because of Safari angry
Would a fallback to setTimeout work instead? How much extra code would that add?
I think a fallback could work, but they can't offer the exact same functionality. There are ever so slightly differences on the expected behavior. But yeah, I would need to refresh my microtasks as well, to be sure.
What can never work, however, is using this feature with WebGL1, unfortunately.
@donmccurdy @Mugen87 - sorry for pinging, but any opinions on theses changes? I imagine many user-level websites/portfolios/demos would benefit with the use of asynchronous gpu-calls.
I'm really not strong-minded about any of the suggestions, just wanted to get some traction to get a solution path going. This is really, really important to enable efficient out-of-the-box Three.js gpgpu and high-performance applications. I have a Frankenstein solution that works for my projects, but others may not.
Fairly certain @gkjohnson ( also sorry for pinging, btw 😄 ) / three-gpu-pathtracer  would, also, be able to use this and similar fenceSync transform callbacks to accelerate performance. Not to mention many other custom builds that use Three.js as a rendering framework.
I'm very support of this feature and in general think we should be encouraging asynchronous readback APIs as much as possible. I'd argue we might even strongly consider replacing the synchronous API with an async one entirely. While maybe convenient reading back pixels before they're ready will always result in unnecessary performance stalls. So it would be nice to see this kind of feature merged at some point for GPGPU work. There are some nice uses for raycasting and likely three-mesh-bvh raycasting and data generation that would be neat to see.
In terms of utility for three-gpu-pathtracer the only place that pixel readback is happening is for pre-filtering the environment map which takes ~15-30ms. Not a crazy amount of time but that is up to two frames of lost parallel work.
I don't have the bandwidth at the moment to dig into the code in this PR but I appreciate the work on this!
Thanks for commenting Garrett, appreciate it.
In terms of utility for three-gpu-pathtracer the only place that pixel readback is happening is for pre-filtering the environment map which takes ~15-30ms. Not a crazy amount of time but that is up to two frames of lost parallel work.
Oh, I thought you were making heavier use of data sync. My mistake then, but still, other aspect of associated async API (like fenceSync) are certainly useful one way or the other. Should we later expand it to use on things like dynamic compile, assets-load, data-transfer.
I don't have the bandwidth at the moment to dig into the code in this PR but I appreciate the work on this!
That's alright, code is mostly boilerplate at this point. The main focus is to find what is the preferred path to implementing gpu readback, and which associated API fits best for the larger audience.
 
 
 
 