KTX-Specification Clarify handling of multi-plane formats

Clarify handling of multi-plane formats

Open lexaknyazev opened this issue 5 years ago • 14 comments

Vulkan 1.1 has introduced multi-planar formats that need special layout. Namely, they consist of 1-3 planes that don't have to have the same dimensions across components. For example:

VK_FORMAT_G10X6_B10X6_R10X6_3PLANE_420_UNORM_3PACK16 Each plane is a one-component image with pixel data stored in the top 10 bits of each 16-bit word, bottom 6 bits are set to 0.

Plane 0: G component, full resolution
Plane 1: B component, half horizontal and half vertical resolution
Plane 2: R component, half horizontal and half vertical resolution

KTX2 must do one of:

disallow such formats;
require a specific layout for storing multi-plane images and document it;
explicitly delegate specification of multi-plane layout to DFD.

Oct 27 '18 16:10 lexaknyazev

I favor the 3rd option. Please submit a PR.

Oct 31 '18 15:10 MarkCallow

The DFD describes the interpretation of memory within "a texel" - where a texel is made up of a sequence of consecutive bytes from some number of planes. (In the case of a 4:2:0 format, this is achieved by pointing two "planes" at the Y plane, with an offset and a two-line stride.) It doesn't cover padding or order of storage of planes (or, indeed, tile swizzling to map coordinates to the bytes contributing to texels).

So the layout can be delegated to the DFD (which can also describe the contents of a VkSamplerYcbcrConversionCreateInfo), but KTX2 will still need to describe the stride and location in the file of each plane - where for the sake of the DFD, "plane" is a bit odd. Assuming it can do that, all is good, although the mapping from the Vulkan types isn't all that hard.

Oct 31 '18 18:10 fluppeteer

@fluppeteer please describe the 4:2:0 case a bit more. How the bits are arranged in memory and how do the offset and two-line stride relate to that. How can the DFD not cover order of storage of the planes? Isn't the offset related to the distance in memory from one plane to the next?

Oct 31 '18 21:10 MarkCallow

@MarkCallow there's an example 4:2:0 data format descriptor at the end of the spec - that particular example assumes the U/Cb and V/Cr planes are stored independently rather than interleaved (in FourCC terms, "I420" rather than "NV12"). The expectation is that a typical implementation will have storage of the Y plane as 8-bit values in some location, for the purposes of discussion, let us assume (although this is not necessary) that the Y values are addressed as "Y_base + x + y×Y_stride" (i.e. linear). Similarly, there is, somewhere, a U plane addressed as "U_base + floor(x/2) + floor(y/2)×U_stride" and a V plane addressed as "V_base + floor(x/2) + floor(y/2)×V_stride".

The data format descriptor does not treat this as having "downsampled planes", because this concept does not extend well to Bayer formats (especially X-Trans); instead it considers a texel block as being a repeating pattern that encompasses some number of coordinates in each axis (currently up to 128; this may be reduced to 16 in a future revision, if there's no counter-example which this would break, so as to allow more precise sub-pixel sample positioning). Typical compressed texel blocks are stored as a consecutive sequence of bytes, covering some area (e.g. 4×4 for the ETC formats). To extend this concept to multi-planar formats, the data format descriptor treats the bytes of each plane addressed at the texel coordinates as though they were concatenated,, and then the existing mechanism which applies to RGB formats is used to pluck bits out of the planes as needed. This mechanism allows true bit-planar representations (as supported, for example, by the Amiga).

For some proprietary ways to store YUV (such as 4 bytes of 2×2 Y data plus U and V, which is a not-uncommon way to store all the necessary data with good spatial locality) this encoding "just works" in a single plane. For a true planar format, we could consider 4:2:2 as encoding a 2×1 texel block, with three planes: 2 bytes in the Y plane, 1 byte of U plane, and 1 byte of V plane.

That is, the bytes for the Y plane of 4:2:2 start at:

plane 1 = Y_base + floor(x/2)×2 + y×Y_stride

4:2:0 poses a question: not all the bytes in a "plane" that contribute to a texel block are consecutive in memory. Rather than providing a special case for this format, the solution is to describe the Y plane as two "planes" from the data format descriptor's perspective, each of which contain only consecutive bytes.

That is, rather than:

plane 1 = Y_base + x + y×Y_stride

...we say:

plane 1 = Y_base + floor(x/2)×2 + floor(y/2)×2×Y_stride plane 2 = Y_base + floor(x/2)×2 + (floor(y/2)×2 + 1)×Y_stride

This is not the conventional view of a "plane" (I freely admit that it's "weird"), but it allows the existing mechanism to be extended to arbitrary YUV alignments - so I don't think weirdness is a reason not to do it. Similarly, YUV 4:1:1 (YYYYUV) is a single plane of Y, but the transposed representation takes four Y planes. Depending on how a 6×6 X-Trans output is stored, this may require six planes, addressed by (floor(y/6)×6 + [0..5])×stride - but there is already a (floor(x/6)×6) term in there, so I don't consider this to be such a reach.

If you have a proprietary mapping between coordinates and bytes (such as Morton order), this complicates the relationship between the planes. But the actual relationship is not defined by the data format descriptor, so that's the user's problem. (And it's not that complicated.)

How can the DFD not cover order of storage of the planes?

Because the planes are stored independently in memory. Indeed, many systems allow arbitrary independent Y, U and V planes (which may have been processed separately) to be combined to give a single "YUV" image. The data format descriptor describes "formats"; the memory location of pixels (let alone planes) is independent of this.

Isn't the offset related to the distance in memory from one plane to the next?

And indeed the stride of the planes. I've also met architectures for which the planes are consecutive, but have a defined amount of padding between them (because the different planes are accessed by coordinates, but the data will be given a proprietary tile swizzle before use). Since the "format" doesn't change just because the size of the image changes, this is considered to be outside the remit of the data format descriptor. As mentioned in the 'required concepts not in the "format"' chapter, the intent isn't to provide a complete description required for an image - there's quite enough to worry about with the pure "format".

I do intend to provide a slightly more explicit example of all of this in a forthcoming spec revision - the questions in this discussion help to guide that, so thank you.

Nov 01 '18 10:11 fluppeteer

This is still a blocking issue.

From the spec's vkFormat definition:

It can be any value defined in core Vulkan 1.1

and

Table 1. Prohibited Formats doesn't list these formats.

Apr 29 '19 16:04 lexaknyazev

CTTF TSG Telecon 5/20/19. Given no hardware correctly supports the transfer functions needed for YUV, etc. and hardware returns RGB to shaders during sampling, are the multi-plane formats really useful as texture storage formats? Main use is for texturing video into a scene and it is generally recommended to use nearest filtering due to the transfer function issues.

Actions:

@dewilkinson to query some devtech folks about the importance of these formats.
@lexaknyazev to give us a list of the extra information needed in a KTX2 file, in case we decide to proceed.

May 20 '19 18:05 MarkCallow

Vulkan's handling of these formats is very explicit and requires some care to get things right.

First of all,

To be used with VkImageView with subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT, sampler Y’C_BC_R conversion must be enabled for the following formats.

The sampler Y’C_BC_R conversion is defined by:

Components swizzling
- Happens before other sampler operations
- Mutually exclusive with "regular" swizzling
- There are quite a few restrictions on valid combinations, such as:
  
  If the format has a _422 or _420 suffix, then components.r must be VK_COMPONENT_SWIZZLE_IDENTITY or VK_COMPONENT_SWIZZLE_B.
ycbcrModel
- RGB (untouched)
- Y’C_BC_R (apply only range expansion)
- Y’C_BC_R 601/709/2020 (range expansion + to RGB)
ycbcrRange
- full
- narrow
{x,y}ChromaOffset
- cositedEven
- midpoint

AFAIU, the DFD can supply this information.

There are 30 formats in total. They could be grouped like:

Single plane KTX2 should be able to handle them as is (assuming the correct DFD).
- Single-resolution (2 formats):
  - 10 or 12 bits per color channel with zero-filled extra bits.
- Multi-resolution (8 formats, 2x1 block, red and blue channels are recorded at half the horizontal resolution):
  - 8/10/12/16 bits per color channel
  - GBGR or BGRG order
Multi-planar (20 formats) AFAIU, implementations would have to query the runtime about expected memory locations of each plane. For the KTX2 spec we have, I think, only one option: to store the planes sequentially.
- Full resolution (444, 3 planes)
  - 8/10/12/16 bits per color channel
- RB at half the horizontal resolution (422)
  - 8/10/12/16 bits per color channel
  - RB can be stored together or separately (G_R_B or G_RB)
- RB at half the resolution in both dimensions (420)
  - 8/10/12/16 bits per color channel
  - RB can be stored together or separately (G_R_B or G_RB)

May 21 '19 18:05 lexaknyazev

Interesting, that some of these formats can be somewhat mapped to other APIs.

The following assumptions should be carefully verified before updating the spec.

Metal

MTLPixelFormatGBGR422 and MTLPixelFormatBGRG422 look very similar to VK_FORMAT_G8B8G8R8_422_UNORM and VK_FORMAT_B8G8R8G8_422_UNORM. Although, Metal doesn't perform any YUV-to-RGB conversion.
MTLPixelFormatBGRA10_XR and MTLPixelFormatBGRA10_XR_sRGB have a layout that is similar to VK_FORMAT_R10X6G10X6B10X6A10X6_UNORM_4PACK16 with red and blue channels swapped. There's also a linear mapping to [-0.752941 .. 1.25098].
MTLPixelFormatBGR10_XR and MTLPixelFormatBGR10_XR_sRGB also have swapped channels and an additional linear mapping (as above).

Direct3D

DXGI_FORMAT_R8G8_B8G8_UNORM and DXGI_FORMAT_G8R8_G8B8_UNORM to VK_FORMAT_G8B8G8R8_422_UNORM and VK_FORMAT_B8G8R8G8_422_UNORM. An interesting comment about D3D usage.
DXGI_FORMAT_R10G10B10_XR_BIAS_A2_UNORM looks similar to Metal's MTLPixelFormatBGR10_XR but with 2 bits of alpha and swapped red and blue.

Also. dedicated video formats:

DXGI_FORMAT_AYUV -> VK_FORMAT_B8G8R8A8_UNORM.
DXGI_FORMAT_Y410 -> swizzled VK_FORMAT_A2B10G10R10_UNORM_PACK32.
DXGI_FORMAT_Y416 -> swizzled VK_FORMAT_R16G16B16A16_UNORM.
DXGI_FORMAT_NV12 -> VK_FORMAT_G8_B8R8_2PLANE_420_UNORM.
DXGI_FORMAT_P010 -> VK_FORMAT_G10X6_B10X6R10X6_2PLANE_420_UNORM_3PACK16
DXGI_FORMAT_P016 -> VK_FORMAT_G16_B16R16_2PLANE_420_UNORM
DXGI_FORMAT_YUY2 -> VK_FORMAT_G8B8G8R8_422_UNORM
FourCC UYVY -> VK_FORMAT_B8G8R8G8_422_UNORM
DXGI_FORMAT_Y210 -> VK_FORMAT_G10X6B10X6G10X6R10X6_422_UNORM_4PACK16
DXGI_FORMAT_Y216 -> VK_FORMAT_G16B16G16R16_422_UNORM
DXGI_FORMAT_NV11 -> no Vulkan support for 4:1:1 subsampling
FourCC IMC1/2/3/4 and YV12 -> swizzled VK_FORMAT_G8_B8_R8_3PLANE_420_UNORM
DXGI_FORMAT_P208 -> VK_FORMAT_G8_B8R8_2PLANE_422_UNORM
FourCC P216 -> VK_FORMAT_G16_B16R16_2PLANE_422_UNORM
FourCC P210 -> VK_FORMAT_G10X6_B10X6R10X6_2PLANE_422_UNORM_3PACK16

May 21 '19 21:05 lexaknyazev

Thanks for the thorough info. @lexaknyazev.

May 22 '19 00:05 MarkCallow

Some comments from the author of the Vulkan YUV extensions.

These are regarding the extra information needed in a KTX2 file.

Mostly the issue is that, while the DFD doesn't say what the relationship is between the memory for a pixel in one plane and in another plane, it also doesn't say what the relationship is between coordinates and memory for a single-plane image - so in that sense multi-planar images aren't special.

For decoding using a DFD, the DFD may just say (for the easy example of 8-bit planar 4:2:2) "two bytes from the first plane, one byte from the second plane, one byte from the third plane; of these, Y(x,y) is in the first 8 bits of the first two bytes; Y(x+1,y) is in the second 8 bits of the first two bytes, Cb(x+0.5,y) is in the 8 bits from the second plane, Cr(x+0.5,y) is in the 8 bits from the third plane". How you got those three planes' worth of data together is your problem. And of course this is implicit - just because it's described that way doesn't mean that it'll be the best way to implement a decoder.

Alexey has proposed sequential. Seems reasonable to me.

Absolutely. Vulkan has a user-controlled DISJOINT bit which lets you decide whether you want to have your planes separate, and that should be the problem of the application, not the format (it's outside KTX2's remit). For a file, you get to choose where you want to put them, just as you don't need to specify a memory address for a single-plane image in the file.

Another tidbit:

The DXGI naming convention seems to be low-to-high-bit, the reverse of Microsoft's previous choice (https://docs.microsoft.com/en-us/windows/desktop/direct3d9/d3dformat) - see the definition of DXGI_FORMAT_R9G9B9A5_SHAREDEXP in https://docs.microsoft.com/en-us/windows/desktop/api/dxgiformat/ne-dxgiformat-dxgi_format (because it's not made obvious for most formats until you get to the remarks at the end). On the other hand, Vulkan packed formats (and GL's) use the D3D9 ordering convention.

As such, DXGI_FORMAT_Y10 is described as having a "view format" of DXGI_FORMAT_R10G10B10A2_, which in Vulkan speak is VK_FORMAT_A2B10G10R10__PACK32.

May 30 '19 22:05 MarkCallow

@dewilkinson to query some devtech folks about the importance of these formats.

@dewilkinson did you get any information? We still need to determine which of these formats, if any, KTX2 should support.

Nov 13 '19 11:11 MarkCallow

KTX-Specification KTX-Specification copied to clipboard

Clarify handling of multi-plane formats

KTX-Specification
KTX-Specification copied to clipboard