gpuweb icon indicating copy to clipboard operation
gpuweb copied to clipboard

Investigation: Programmable Blending

Open litherum opened this issue 6 years ago • 24 comments

Motivation

There have been quite a few requests for pieces of custom blending, but the analysis hasn’t been linked together into a coherent investigation.

Existing issues:

The use case is being able to have materials in a rendered scene that are not simply blended (or min’ed or max’ed) with the rest of the scene. This isn’t really relevant to physically-based renderers, but is more relevant for things like cartoon shaders.

This is used in Lumberyard, Just Cause 3, Grid 2, and The Forge.

Similarly, achieving effects like the “vibrancy” effect that’s used all over macOS and iOS would use custom blending. This uses a custom formula to make sure that the foreground is always visible and readable on top any possible background. Here’s an example:

Screen Shot 2019-09-20 at 10 12 11 AM

Programmable blending functionality can’t be emulated by either API-level texture barriers or by adding additional render passes, because there’s nowhere to save the intermediate results of overlapping geometry. This investigation is about additional capabilities, rather than additional performance.

Difficulty

There are two distinct pieces here:

  • Being able to read from (and write to) the rendering destination
  • Because the order of fragment shader execution is undefined, overlapping geometry have to have some synchronization for the read/modify/write cycle to be race-free for each pixel.

Unfortunately, support in the various APIs is different for each of these pieces.

Direct3D

Direct3D has no facility for reading from the framebuffer (that I could find). However, you can bind a texture as a RWTexture. However, if you do this, your reads and writes are unordered.

In Shader Module 5.1, there’s another object which is a drop-in replacement for RWTextures: RasterizerOrderedTextures. These have the guarantee that all operations on this resource, between any two fragment shader invocations which target the same framebuffer location (and level and sample), will be strictly ordered. Beyond that, the ordering is guaranteed to be in API submission order.

This means that, if you bind the destination texture as a UAV, rather than binding it as a framebuffer, you can do programmable blending on that resource.

It looks like its a requirement that all D3D12 devices support Shader Model 5.1. However, support for Rasterizer Order Views is optional; to detect support, check the ROVsSupported field in the return of D3D12Device::CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS).

macOS Metal

Similarly to Direct3D, macOS Metal doesn’t have any facility for reading from the framebuffer. However, you can do the same trick of binding the texture as a texture2D<access::read_write> instead of binding it to the framebuffer.

Then, you can mark the texture as belonging to a “Raster Order Group”, which has the same guarantees that RasterizerOrdered resources have in HLSL. You do this by simply annotating the image with [[raster_order_group(0)]].

Unfortunately, not all hardware supports raster order groups, and support isn’t aligned with any of the existing GPU Family demarcations; instead, authors have to check device.areRasterOrderGroupsSupported. Also, not all hardware can support access::read_write textures; authors have to query support by calling MTLDevice.readWriteTextureSupport.

iOS Metal

iOS Metal has the same concepts of Raster Order Groups, but extends it to work with the framebuffer. The fragment shader can mark a value as both a framebuffer color and a raster_order_group by annotating it with [[color(0), raster_order_group(0)]]. It can read this value from the framebuffer by simply repeating the same object as a parameter to the shader:

struct PixelShaderOutput {
    uint result [[color(0), raster_order_group(0)]];
};

fragment PixelShaderOutput fragmentShader(PixelShaderOutput pixelShaderOutput) {
    ...
}

This means that programmable blending works naturally.

Vulkan

The story on Vulkan is much more complicated: https://github.com/KhronosGroup/Vulkan-Ecosystem/issues/27. Nothing is present in pure Vulkan, but there are some extensions:

VK_EXT_fragment_shader_interlock (GPUInfo says 8% on Windows, 4% on Linux, and 0% on Android): Adds explicit functions for locking and unlocking an implicit mutex. There’s one mutex per pixel/level/sample in the framebuffer. Given that none of the other APIs support explicit locking & unlocking, and the fact that the other API’s designs are easier to get right than this kind of explicit API, I’d recommend against adding this design into WebGPU.

GL_EXT_shader_framebuffer_fetch: Lets you read from the framebuffer, but this is a GL extension, not a Vulkan extension.

VK_EXT_blend_operation_advanced: Doesn’t allow true programmable blending, but does allow some pre-canned blend equations. Also, presence of this extension doesn’t mean that the blend equations actually work in overlapping geometry; there’s an extra bit exposed by this extension which represents whether the blend operations are threadsafe with respect to overlapping fragments.

OpenGL (just for fun)

ARB_shader_image_load_store includes a memoryBarrier() GLSL function which can be used to order reads and writes to resources. INTEL_fragment_shader_ordering includes a modal API where you can toggle between “all reads/writes are unordered” and “all reads/writes are ordered” by calling beginFragmentShaderOrderingINTEL() at the boundary.

litherum avatar Sep 20 '19 06:09 litherum

FYI, DX12 support for ROV can be found in NVidia presentation on slide 48: dx12-features

kvark avatar Sep 20 '19 15:09 kvark

See also: https://github.com/gpuweb/gpuweb/issues/439

litherum avatar Sep 20 '19 18:09 litherum

VK_EXT_fragment_shader_interlock (GPUInfo says 8% on Windows, 4% on Linux, and 0% on Android): Adds explicit functions for locking and unlocking an implicit mutex. There’s one mutex per pixel/level/sample in the framebuffer. Given that none of the other APIs support explicit locking & unlocking, and the fact that the other API’s designs are easier to get right than this kind of explicit API, I’d recommend against adding this design into WebGPU.

It's actually the most powerful model of all APIs, and requires literally next to zero support on the non-shader side of things.

Also consistently rising its 9% on Windows and 6% on Linux, its basically drivers catching up, the support should be around 40% on Windows and Linux.

Also this exists as very old extensions for OpenGL:

  • GL_INTEL_fragment_shader_ordering 19.4%
  • GL_NV_fragment_shader_interlock 19.4%
  • GL_ARB_fragment_shader_interlock 22% (All the Nvidias + a few Intels that they bothered to update the drivers for)

GL_EXT_shader_framebuffer_fetch: Lets you read from the framebuffer, but this is a GL extension, not a Vulkan extension.

AFAIK a poor precursor to GL_INTEL_fragment_shader_ordering, and its also only supported on Intel.

ARB_shader_image_load_store includes a memoryBarrier() GLSL function which can be used to order reads and writes to resources.

Nope, just makes sure that reads and writes of a single invocation don't get moved by the compiler (or out of order execution when GPUs get deep enough shader pipelining). Does absolutely nothing for the ordering between threads (except that a different thread will not see the writes done after the barrier to the ones before the barrier). Its pretty much just like std::atomic_thread_fence.

INTEL_fragment_shader_ordering includes a modal API where you can toggle between “all reads/writes are unordered”

That switch is also present in https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt which is the basis for VK_EXT_fragment_shader_interlock

In Shader Module 5.1, there’s another object which is a drop-in replacement for RWTextures: RasterizerOrderedTextures. These have the guarantee that all operations on this resource, between any two fragment shader invocations which target the same framebuffer location (and level and sample), will be strictly ordered. Beyond that, the ordering is guaranteed to be in API submission order. This means that, if you bind the destination texture as a UAV, rather than binding it as a framebuffer, you can do programmable blending on that resource.

That seems to me incredibly backward and sub-optimal, you lose color channel compression and most probably fast-on-chip memory, plus the shader has absolutely no guarantees or ways to check that each fragment location will write to the same buffer location so all tilers get confused.

Similarly, achieving effects like the “vibrancy” effect that’s used all over macOS and iOS would use custom blending. This uses a custom formula to make sure that the foreground is always visible and readable on top any possible background. Here’s an example:

Even if the fragment shader suddenly allowed access to the full Tiler's tile cache and provided mutexes with entire work groups, there's no way it would help you in achieving that effect.

Its a convolution kernel, you need all neighbouring pixels to be ready by layer by layer.

Discussed in the October 28th 2019 call https://docs.google.com/document/d/1vjEeT_CO2zlHZ2K5SiNMdROVDk6ag8skSgN-ZEO4evg/edit

litherum avatar Oct 28 '19 19:10 litherum

Looking at the Vulkan specification it seems that it might be technically possible to implement something like raster order views on a framebuffer attachment using subpass self-dependency from VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT to VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT after every draw and having the barrier be VK_DEPENDENCY_BY_REGION_BIT.

That said, this is extremely theoretical and isn't covered by tests in the Vulkan CTS, the only self-dependency test being for writing to an indirect buffer during and using it in the same pass. I wouldn't rely on it working. Also producing a barrier after every draw is likely to lead to MUCH worse performance on a lot of hardware.

We should rely on Vulkan extensions if we are to implement some form of ROV.

Kangz avatar Oct 29 '19 16:10 Kangz

@Kangz that would give you synchronization between draw calls, but still no guarantee about any order within multiple fragment shader invocations within the same draw.

kvark avatar Oct 29 '19 18:10 kvark

Looking at the Vulkan specification it seems that it might be technically possible to implement something like raster order views on a framebuffer attachment using subpass self-dependency from VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT to VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT after every draw and having the barrier be VK_DEPENDENCY_BY_REGION_BIT.

@Kangz you need to throw in EXT_fragment_shader_interlock plus put the subpass input VKImageSubresource into the correct layout for both read and write, and then that soup will get what you want (a ROV).

@kvark true thanks for pointing this out.

@devshgraphicsprogramming I was trying to see if it was possible to implement an ROV in unextended Vulkan but turns out it's not possible.

Kangz avatar Oct 30 '19 09:10 Kangz

This post might serve as an unpopular opinion but after seeing the direction an IHV is taking with respect to future hardware designs and the options raised by @litherum in a call referenced in this thread I believe it is in the interests of many IHVs choose number 1 which is to shelve the concept of exposing this feature.

Programmable blending via either framebuffer fetch or interlocks is not a good idea anymore from both a hardware design and a performance perspective. I don't believe that shader blending is an optimal path from a future hardware design standpoint since it greatly limits parallel shader execution on the GPU with the addition of sync points inside the shader programs. Even tilers will eventually struggle to cope with the additional synchronization overhead of these mechanisms as hardware design inevitably becomes more parallel with over a thousand threads.

Interlocks are already proven to be performance disasters on current discrete GPUs including for advanced use cases like OIT when per-pixel linked lists are faster on those vendors. So the only practical use case left for programmable blending are on mobile GPUs doing OIT but vendors over there are already starting to add or expand on fixed function blending since it is both more power efficient/performant and there's no synchronization overhead cost as involved with doing shader blending.

Doing OIT with programmable blending might very well suit mobile GPUs better right now but it's highly questionable if that path will remain continuously scalable in the foreseeable future with more parallel hardware designs and in cases with higher overdraw factor. There are many other potentially viable methods available out there to tackle OIT and transparency such as Weight-Blended/Phenomenological OIT from Morgan McGuire, SLAB/Hashed Alpha from Chris Wyman, Moment-Based OIT from Christoph Peters or even the original implementation of adaptive transparency which was initially variable memory and was popularized by Marco Salvi at Intel.

Is it truly necessary to expose a highly contentious feature for effectively just a single use case like OIT and for a specific set of hardware ? That goes especially with time since the idea is on the losing side with the future direction of hardware design ...

Degerz avatar Jan 19 '20 11:01 Degerz

The argument that this is already being used and shipped in 4 different games/engines (Lumberyard, Just Cause 3, Grid 2, and The Forge) is a much stronger argument than "I don't believe that shader blending is an optimal path from a future hardware design standpoint." As for the performance concerns, if performance is good enough for these games/engines today, and they are achieving valuable effects for their customers, the feature seems worthy of pursuit.

If we had an official statement from (at least) one of the major hardware vendors about the future direction of their products, that would definitely be helpful when determining whether or not to pursue this feature. (I'm not counting the previous link because I don't understand what the text snippets mean without additional context.)

litherum avatar Jan 19 '20 23:01 litherum

Just engine/framework integration alone like either in the case of Lumberyard or the Forge isn't going to serve as a good argument for standardizing a specific feature. Unreal Engine 4 uses tons of NVAPI driver extensions which are specific to Nvidia hardware so my example here might be somewhat different than yours but my point still stands that it takes more than just available functionality to make a feature become truly vendor neutral in the sense it won't trigger the slow paths for other hardware vendors.

With Grid 2 you might have a better argument with that case but it's usage is locked behind Intel driver extensions. As for Just Cause 3, I don't know if Avalanche actually released these graphical features into the main game or if they did some experimentation in a private branch which were never released. Coincidentally both of their findings were sponsored by Intel as well ...

As far as official statements are concerned, here's what Nvidia has to say:

Don’t use Raster Order View (ROV) techniques pervasively Guaranteeing order doesn’t come for free Always compare with alternative approaches like advanced blending ops and atomics

As for AMD, when I tried asking them to expose interlocks on their Vulkan drivers they refused the request. We found out from an Intel sample on potentially why they made this decision since they were comparatively 20x slower in their transparent pass from alpha blending to 2 nodes for adaptive transparency.

It is implied from the above statements that interlocks are a slow path on both current discrete GPU vendors. I am not telling you that it is the potential case when programmable blending will become the slow path in the future. I am telling that it is already the case that programmable blending is the slow path on some vendors.

My previous link only furthers my assertion that programmable blending isn't a fast path anymore and it applies even for tilers as well. Doing the blending inside the shader isn't a good idea from both a current hardware perspective like with discrete GPUs and potentially from a future hardware design perspective as well for mobile GPUs. Interlocks are proven not to be a scalable or a sustainable solution for discrete GPUs since there's too much synchronization overhead involved with the ordering so it's highly doubtful that framebuffer fetch will remain as a good idea as well for mobile GPUs in the future for similar reasons as I mentioned with interlocks. With compute growing at an uncontrollable rate even tilers will meaningfully feel the impacts of additional synchronization overhead ...

Degerz avatar Jan 20 '20 08:01 Degerz

I don't think the fact that using synchronization is slower than not using synchronization is sufficient evidence to not pursue the feature. It's a useful and implementable feature. I've demonstrated multiple use cases in this thread where the effect is worth the loss in performance.

litherum avatar Jan 20 '20 14:01 litherum

Everything has a spec & implementation cost. Just because some games once used geometry shaders, does not mean that we should invest in geometry shaders today. Native APIs are littered with unobvious performance hazards. This might make sense as a vendor extension to WebGPU, but I don't think it makes sense for core. At the very least, it wouldn't be implementable on AMD.

For another data point, Programmable Blending was at one point on the docket for D3D12 features (look back in some of the old D3D12 reveal streams), but it was removed since it wasn't sanely implementable on most discrete HW.

magcius avatar Jan 20 '20 17:01 magcius

@litherum I would argue otherwise since there is currently both a more performant and a higher quality method to do OIT on discrete GPUs rather than using interlocks. As @magcius mentioned true programmable blending via render target reads at one point in the past was considered but strong opposition from AMD and Nvidia prevented Microsoft and Intel from standardizing this feature. AMD even further into the past was also exploring (slide 39) the idea of programmable blending as well but nothing came out of it since it fundamentally goes against their hardware designs and Nvidia likely came to the same conclusion as well ...

I've demonstrated multiple use cases in this thread where the effect is worth the loss in performance.

Nearly all of the examples you've demonstrated has to do with OIT so I'm not sure I'd consider covering a single topic as "multiple use cases".

The synchronization overhead cost of interlocks are already proving to be unmanageable on current discrete GPUs and it'll be interesting to see if Intel's perspective will change once they dive deeper into discrete GPUs as well because even they must realize how perilous programmable blending is to the scalability of larger GPUs. Now my argument may not yet be all that convincing to mobile GPUs which are usually tilers but it's still highly uncertain in the future if programmable blending will remain as a good idea as mobile GPUs too become more parallel ...

Degerz avatar Jan 20 '20 22:01 Degerz

If you attach both your gbuffers and your destination output buffers to a single render pass, you can implement a simple deferred renderer using rasterizer order groups in a single render pass.

litherum avatar Feb 28 '20 08:02 litherum

Related: https://github.com/gpuweb/gpuweb/issues/435

litherum avatar Apr 30 '20 17:04 litherum

Linking to another investigation I made specifically for use internally in Chromium but with an eye towards maybe having something in WebGPU one day.

Kangz avatar Sep 26 '23 15:09 Kangz

Btw We're submitting a talk and a chapter to GPU Zen 3 about Constructive Solid Geometry using FS-IL and unlike MLAB4 which kinda works with a homebrew spinlock and out-of-order blends, this won't work without Raster Ordered Views.