gpuweb
gpuweb copied to clipboard
Colorspace conversions for GPUExternalTexture are arbitrarily complex
[Migrated from #3288]
Videos can embed arbitrary ICC profile information within themselves to describe the colorspace their texels are in. ICC profiles are not Turing-complete, but are arbitrarily complex - they're described as a vector of transformations of various forms. Each frame of a video can be in a different colorspace (though this is very rare).
When compiling the shader, we don't know which colorspace conversion we're going to have to be implementing in the shader.
I'm also coming from the perspective that, for the 3d video use case, if a full full-res copy is performed, that will be catastrophic enough that it would break the whole illusion of the video playback (particularly on mobile devices).
Here are some ideas:
- Have the browser internally support a fixed-set of classes-of-colorspaces in the code it emits, and if someone imports a
GPUExternalTextureusing an incompatible colorspace, then that import operation silently involves a copy (at full res, no less). Authors would have no way of knowing which browsers support which colorspaces.- Have the WebGPU spec list which colorspaces must be supported.
- Encode ICC profile information as variable-length data, and include that with all the other
GPUExternalTexturedata sent to the shader (crop rect, rotation, etc.). Consume the whole ICC profile in the shader.- Do this, but also special-case some of the more common colorspaces (to avoid the variable-length consumption loop in the common case).
- Recompile the shader at draw-call time.
- Use something like
MTLDynamicLibraryto compile just the sample operation at draw call time. This only works on some shaders on some devices.
- Use something like
- Give authors the ability to interact with the zero-copy / one-copy distinction. Add an intentionally one-copy
GPUQueue.writeTexture()overload that accepts anHTMLVideoElement(and possibly a crop rect?). This option comes in a few different flavors:- Make
importExternalTexture()fail when importing a video frame using an unsupported colorspace. The page can react by using the explicit one-copywriteTexture(). - Add a
failIfMajorPerformanceCaveatbool switch intoGPUExternalTextureDescriptor. This is a way for an author to say "give me the fast mode or give me nothing at all" - Make
failIfMajorPerformanceCaveatbe the default, and add adontFailIfMajorPerformanceCaveatbool switch instead. We'd probably have to add in a region-of-interest rect into theGPUExternalTextureDescriptorif we did this, so we didn't end up silently copying the whole image.
- Make
- Add the
GPUExternalTextureDescriptoritself tocreateShaderModule(). This isn't perfect because 1) the video may not be loaded yet, and 2) each frame of the video can technically be in a different colorspace. We may be able to solve (1) by specifying the state the video needs to be in, and (2) may be rare enough that it may not be a problem. - Do it like
GPUTextureDescriptor.viewFormats: Give a way for an author to describe a colorspace to us, and have them do that atcreateShaderModule()time.
Maybe there are more possibilities!
Initially, right off the bat, I think I favor some sort of hybrid between 2.i and 3.i, but I'd need to implement them to see how fast they are in reality. A variable-length consumption while we compile just the sample operation in the background, ~once per video, with the most-common classes-of-colorspaces being pre-compiled, sounds plausibly workable.
The downside to option 1 is that it's nonportable - content that works great in one browser will fail in another.
The downside to option 1.i is that people who have video files don't often have the ability (or desire) to re-encode them to fit with our spec (the people who work on video production are often not the same people who work on web delivery).
The downside to option 4 is it requires the page to surround each call to importExternalTexture() with an error scope, and it's likely (possibly certain) that the promise will resolve too late to use the external texture in the current frame being drawn. So the zero-copy/one-copy decision will need to be done in the GPU Process, not by web content.
The downside to option 6 is it's kind of a nightmare from a specification point of view. Are WebGPU authors really going to be describing the details of the various colorspaces to us, just so they can render their video?
It's unfortunate how ICC profiles can be arbitrarily complex. Do we know what the complexity is on average in the wild? Looking at WebCodec is seems that it could be reasonable in most cases? I believe #3288 argues that the content of the video shouldn't need to be decoded, however it's probably ok to require that the content-timeline knows about the metadata of a frame. This means that for 4) we wouldn't need push/popErrorScope and instead could throw an exception.
If we believe that ICC profiles are most often simple ones, then 4) is pretty attractive: it is still simple for developers, has a reasonable fallback, but allows controlling performance and going through copyExternalImageToTexture if need be. Variable length consumption in shaders is a bit scary because of the potentially decreased occupancy of shaders with the complexity of the code.
Right, the common ICC profiles almost always just have a few transformations in them, and each transformation is fairly simple, so they are relatively fast to compute.
As for the expressivity, though, there are like over a dozen forms each transformation can have, and different transformations have different parameters they need (e.g. sRGB sets the gamma to 2.2, but other colorspaces don’t have a gamma term).
So the concern here isn’t about the performance of running the conversion; it’s about “how do I know which code to emit in the shader at the site of a sample(texture_external), and how do I know how much data is necessary to give the conversion code access to?”
It's also probably worth pointing out that most videos only use one of a few colorspaces. So any kind of "precompute for the situations we know will be common" will get us pretty far.
Although ICC profiles are arbitrary complex, do we have chance to list all of steps A-B color space conversion requires? Per my understanding, color space conversion always use a reference color space(usually CIEXYZ, but not sure whether could be configured), transfer A color space to that reference color space first and then transfer from the reference color space to B color space. But it might requires gamma decoding, unpremul-alpha, HDR specific works etc during the conversion. But I think the steps are limited.
So if it is possible to list all of the steps A-B color space conversion requires(I think we could but pls correct me if I'm wrong), we could use a step_mask in uniform to guide how to do A-B color space conversion with a library-like implementation of all of these steps.
Add an intentionally one-copy GPUQueue.writeTexture() overload that accepts an HTMLVideoElement
@kainino0x just reminded me, we already have GPUQueue.copyExternalImageToTexture().
@kainino0x just reminded me, we already have GPUQueue.copyExternalImageToTexture().
The problem of GPUQueue.copyExternalImageToTexture() is that we might lose the 1-copy chance due to implementation(Assume it could accept HTMLVideoElement directly while it doesn't now). If VideoFrame lives in different gpu device from user created GPUTexture, we need a intermediate resource to do:
VideoFrame ---> Intermediate Resource ---> GPUTexture.
The benefit of importExternalTexture() is that it returns an object so it could returns the intermediate resource directly in this situation to achieve 1-copy path.
I generally expect to just emit an uber-function for sampling, that will be able to handle anything without recompilation. The ultimate slowpath of "decode and convert on CPU, upload as pre-converted rgba" will likely still exist to paper over cases where the one-size-fits-most approach breaks down.
It would be useful for authors to tell us what degree of accuracy they expect, though. We're expecting to use a ~30x30x30 LUT for general colorspace conversions for display/screen output in many cases. We'd normally re-use those facilities for webgpu. The accuracy is quite good, but if the author wants as-precise-as-possible, we can do the harder thing of supporting the matrix transforms and precise luts for e.g. gamma from ICC profiles. (but I don't want to do the precise-support thing if authors don't need it, and most don't)
I suppose this is # 1? I don't expect users to run into many slowpaths in practice, though. I don't think this is something we need to spec.
Oh yes, I forgot we planned out the possibility of using a LUT. Since a LUT is precomputed that may be enough for arbitrary ICC profiles without any fallbacks.
I implemented colorspace conversion for GPUExternalTextures in Chromium a few months back according to the process in this document.
My solution was to place all the conversion constants (specific to that individual conversion) into a buffer (created upon the ImportExternalTexture call) and bind the buffer to a transformed version of the user's shader, which does the full conversion process (gamma decode, gamut conversion, gamma encode).
When I reached out to the Chrome media team for advice on which color spaces were important to support - I was told BT.601 (full, limited), BT.709 (full, limited), and BT.2020 (limited, PQ, HLG), which narrowed the scope enough that I could use a genericized transfer function that could work for all of these.
I had considered maybe hardcoding the relevant constants - but I found the constants already existed within Chromium for other components that do color space conversion, so I was able to just pull and adapt those to be used in the transformed shader.
In Chromium at shader compile time, this:
@group(0) @binding(0) var s : sampler;
@group(0) @binding(1) var ext_tex : texture_external;
@fragment
fn main(@builtin(position) coord : vec4<f32>) -> @location(0) vec4<f32> {
return textureSampleLevel(ext_tex, s, coord.xy);
}
is transformed to this:
struct GammaTransferParams {
G : f32,
A : f32,
B : f32,
C : f32,
D : f32,
E : f32,
F : f32,
padding : u32,
}
struct ExternalTextureParams {
numPlanes : u32,
doYuvToRgbConversionOnly : u32,
yuvToRgbConversionMatrix : mat3x4<f32>,
gammaDecodeParams : GammaTransferParams,
gammaEncodeParams : GammaTransferParams,
gamutConversionMatrix : mat3x3<f32>,
}
@group(0) @binding(0) var s : sampler;
@group(0) @binding(1) var ext_tex : texture_2d<f32>;
@group(0) @binding(2) var ext_tex_plane_1 : texture_2d<f32>;
@group(0) @binding(3) var<uniform> ext_tex_params : ExternalTextureParams;
fn gammaCorrection(v : vec3<f32>, params : GammaTransferParams) -> vec3<f32> {
let cond = (abs(v) < vec3<f32>(params.D));
let t = (sign(v) * ((params.C * abs(v)) + params.F));
let f = (sign(v) * (pow(((params.A * abs(v)) + params.B), vec3<f32>(params.G)) + params.E));
return select(f, t, cond);
}
fn textureSampleExternal(plane0 : texture_2d<f32>, plane1 : texture_2d<f32>, smp : sampler, coord : vec2<f32>, params : ExternalTextureParams) -> vec4<f32> {
var color : vec3<f32>;
if ((params.numPlanes == 1u)) {
color = textureSampleLevel(plane0, smp, coord, 0.0f).rgb;
} else {
color = (vec4<f32>(textureSampleLevel(plane0, smp, coord, 0.0f).r, textureSampleLevel(plane1, smp, coord, 0.0f).rg, 1.0f) * params.yuvToRgbConversionMatrix);
}
if ((params.doYuvToRgbConversionOnly == 0u)) {
color = gammaCorrection(color, params.gammaDecodeParams);
color = (params.gamutConversionMatrix * color);
color = gammaCorrection(color, params.gammaEncodeParams);
}
return vec4<f32>(color, 1.0f);
}
@fragment
fn main(@builtin(position) coord : vec4<f32>) -> @location(0) vec4<f32> {
return textureSampleExternal(ext_tex, ext_tex_plane_1, s, coord.xy, ext_tex_params);
}
GPU Web 2022-09-07/08 APAC-timed
- MM: we're actively discussing how we think this should work. I volunteered to put it on the agenda for this week, but our internal discussions haven't concluded yet.
- MM: we are making good progress though.
- KG: you've had good feedback on long running topics like this in the past - appreciate your in-depth thought.
Meeting: Resolved on no change (see minutes, to be posted here later)