MoltenVK
MoltenVK copied to clipboard
PVRTC texture content loading via memory mapping is not valid solution.
In response to issue https://github.com/KhronosGroup/MoltenVK/issues/359 Documentation of MoltenVK now states:
"Image content in 'PVRTC' compressed formats must be loaded directly into a 'VkImage' using host-visible memory mapping. Loading via a staging buffer will result in malformed image content."
(https://github.com/KhronosGroup/MoltenVK/blob/master/Docs/MoltenVK_Runtime_UserGuide.md#limitations)
This statement is incorrect and this method will fail to properly load texture on newer Apple devices. Issue with malformed image content when using "standard" Vulkan approach (staging buffer) is caused by 'MTLBlitOptionRowLinearPVRTC' function, that expects data to have linear layout (as discussed in #359), but we can't use direct data copy instead. PVRTC requires that we use VK_IMAGE_TILING_OPTIMAL as image tiling, which results implementation-dependent arrangement of data in GPU memory.
If we use following "test" texture (original size 2048x2048):
and try to load it using memory mapping it will load fine on older iPhones and iPads (on those that use PowerVR GPus), but on those that use Apple own GPU we will see this image:
PVRTC format stores its compressed blocks using Morton order (https://en.wikipedia.org/wiki/Z-order_curve), and I assume this layout is consistent across all of PowerVR GPUs, but Apple approach is different. We can't know layout of texture data in memory at runtime when using VK_IMAGE_TILING_OPTIMAL, so only viable solution that works, is to "untwiddle" PVRTC data and use standard "staging buffer" approach.
This code can be used to "untwiddle" PVRTC data:
inline uint64_t SeparateBits32( uint32_t a )
{
uint64 x = a;
x = (x | x << 16) & 0x0000FFFF0000FFFF;
x = (x | x << 8) & 0x00FF00FF00FF00FF;
x = (x | x << 4) & 0x0F0F0F0F0F0F0F0F;
x = (x | x << 2) & 0x3333333333333333;
x = (x | x << 1) & 0x5555555555555555;
return x;
}
inline uint64_t MortonOrderSquare( uint32_t X , uint32_t Y )
{
return (SeparateBits32( X ) << 1 ) | SeparateBits32( Y );
}
inline bool UnTwiddlePVRData( const uint8_t* pInData , size_t InDataSize , uint8_t* pOutData , size_t OutDataSize , uint32_t Width , uint32_t Height , uint8_t BPP )
{
if( BPP != 2 && BPP != 4 )
return false;
auto IsPowerOfTwo = []( uint32_t Val ){ return ((Val-1)&Val)==0; };
if( Width != Height || !IsPowerOfTwo( Width ) || !IsPowerOfTwo( Height ) )
// Metal requires PVRTC textures to be squares and power of 2
return false;
const uint32_t XBlockSize = BPP == 4 ? 4 : 8;
const uint32_t YBlockSize = 4;
const auto BlockXDim = max( 2 , Width / XBlockSize );
const auto BlockYDim = max( 2 , Height / YBlockSize );
const auto ExpectedSize = size_t(BlockXDim) * size_t(BlockYDim) * 8;
if( InDataSize != ExpectedSize || OutDataSize != ExpectedSize )
return false;
for( uint32 y = 0; y < BlockYDim ; ++y )
for( uint32 x = 0; x < BlockXDim ; ++x )
memcpy( pOutData + (y*BlockXDim+x)*8 , pInData + MortonOrderSquare( x , y )*8 , 8 );
return true;
}
@ONEMK
Thanks for all your research. Very helpful!
Are you indicating that the Metal replaceRegion:...
operation will not correctly load a PVR-formatted image in memory for some newer Apple SoC's?
Which Apple SoC's have a GPU for which this will not work?
Apple's documentation for that method indicates that it does still support PVRT formats. Perhaps we should be filing a bug report with Apple about that, if this is the case?
Thanks for contributing the code patch to untwiddle the PVRT layout. Are you thinking this operation is something the app should do before loading into the VkBuffer
? Or are you thinking MoltenVK could somehow automatically perform this untwiddling operation on the buffer memory when processing a vkCmdCopyBufferToImage()
command, before submitting it to Metal?
BTW...can you post the PVR texture you use above, as a PVR file, please? It will prove helpful as a test texture.
Since I don't have any experience with Metal, I can't confirm if "replaceRegion" will work. We picked MoltenVK because we wanted to avoid writing completely new renderer backend for only one platform, especially with API that is as low-level as Metal and Vulkan. I assume, that using 'replaceRegion' would require me to build MolenVK from sources and patch 'vkCmdCopyBufferToImage'?
My previous attempt (that worked on PowerVR) was to:
- Create image with format VK_FORMAT_PVRTC1_4BPP_UNORM_BLOCK_IMG and tiling VK_IMAGE_TILING_OPTIMAL
- Allocate memory for it
- map image memory
- memcpy PVRTC data to it
- unmap
This code is simplified version of what I previously had:
VkImageCreateInfo ImageInfo = {};
ImageInfo.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO;
ImageInfo.imageType = VK_IMAGE_TYPE_2D;
ImageInfo.extent.width = Width;
ImageInfo.extent.height = Height;
ImageInfo.extent.depth = 1;
ImageInfo.mipLevels = 1;
ImageInfo.arrayLayers = 1;
ImageInfo.format = VK_FORMAT_PVRTC1_4BPP_UNORM_BLOCK_IMG;
ImageInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
ImageInfo.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;
ImageInfo.usage = VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT;
ImageInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
ImageInfo.samples = VK_SAMPLE_COUNT_1_BIT;
if( vkCreateImage( m_pContext->m_Device , &ImageInfo , nullptr, &m_Handle ) != VK_SUCCESS )
return false;
VkMemoryRequirements MemRequirements = {};
vkGetImageMemoryRequirements( m_pContext->m_Device , m_Handle , &MemRequirements );
auto AllocInfo = VmaAllocationCreateInfo{};
AllocInfo.usage = VMA_MEMORY_USAGE_GPU_ONLY;
AllocInfo.requiredFlags = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
AllocInfo.preferredFlags= VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT;
VmaAllocation Allocation = nullptr;
if( vmaAllocateMemory( m_Allocator , &MemRequirements , &AllocInfo , &Allocation , nullptr ) != VK_SUCCESS )
return false;
if( vmaBindImageMemory( m_Allocator , Allocation , m_Handle ) != VK_SUCCESS )
return false;
{
uint8_t* pMemory = nullptr;
if( vmaMapMemory( m_Allocator , Allocation , (void**)&pMemory ) != VK_SUCCESS )
return false;
memcpy( pMemory , pPVRTC_ImageData , PVRTC_ImageDataSize );
vmaUnmapMemory( m_Allocator , Allocation );
}
This code worked on:
iPhone 7S+ (2016) CPU: Apple A10 Fusion iOS: 12.3.1 (Darwin) Metal: 61.1
iPad Pro (10.5 inch) CPU: Apple A10X iOS: 13.4.1
and failed on:
iPad pro (12.9 inch)(3rd generation)(2008) CPU: Apple A12X Bionic iOS: 13.3 (Darwin) Metal: 66.6
iPhone 11 CPU: A13 Bionic iOS: 13.4.1
I think, that untwiddling data during 'vkCmdCopyBufferToImage' could be good solution, since it would ensure that loading compressed textures via staging buffer works on all platforms without any nasty hacks dedicated to iOS. My current solution is to simply untwiddle data when I fill staging buffer. This is simplified code that I have now:
#if VKL_PLATFORM_APPLE
if( IsPVRTCFormat( Format ) )
{
uint8_t* pBufferMemory = nullptr;
if( vmaMapMemory( m_Allocator , BufferMemory , (void**)&pMemory ) != VK_SUCCESS )
return false;
if( !UnTwiddlePVRData( pPVRTC_ImageData , PVRTC_ImageDataSize , pBufferMemory , BufferSize , Width , Height , BPP ) )
return false;
vmaUnmapMemory( m_Allocator , BufferMemory );
}
else
#endif
@billhollings
Here's how we handle this issue in another layered implementation. Right before calling copyFromBuffer
with PVRTC1 formats do the following:
- If the texture's height is less than 8, skip all these steps and call
copyFromBuffer
as usual. - Assert that texture's width and height in pixels are powers of two (inherent PVRTC1 requirement).
- Assert that texture's width and height are equal (Metal-specific requirement).
- Assert that texture's width and height in pixels are not greater than
262144
(current kernel limit, could be lifted later). - Assert that
sourceOffset
is a multiple of 8. - Create a compute pipeline with the following kernel:
kernel void linearizePVRTC(ushort2 position [[thread_position_in_grid]], constant uint2 *blocks [[buffer(0)]], constant uint2 *srcBuffer [[buffer(1)]], device uint2 *dstBuffer [[buffer(2)]]) { if (any(uint2(position) >= *blocks)) return; uint2 t = uint2(position); t = (t | (t << 8)) & 0x00FF00FF; t = (t | (t << 4)) & 0x0F0F0F0F; t = (t | (t << 2)) & 0x33333333; t = (t | (t << 1)) & 0x55555555; dstBuffer[position.y * (*blocks).x + position.x] = srcBuffer[(t.x << 1) | t.y]; }
- Pass the following data as
buffer(0)
:uint32_t blocks[2]; // buffer(0) blocks[0] = std::max(width / (_2bpp ? 8 : 4), 2); blocks[1] = height / 4;
- Attach the original buffer with
sourceOffset
asbuffer(1)
. - Allocate a temporary buffer to hold the texture's bytes and attach it as
buffer(2)
. - Dispatch the kernel using
blocks[0]
andblocks[1]
as the intended grid size. - Call
copyFromBuffer
usingbuffer(2)
and zero offset.
Couple more notes:
- 262144 size limit comes from 32-bit math being used for kernel operations. May be increased by rewriting the kernel's body with 64-bit math. Does not really matter in practice because max texture size on all Apple devices is 16384.
- FWIW, mapping
MTLBuffer
contents to CPU-visible memory and reuploading it withreplaceRegion
also works for us.
Here's how we handle this issue in another layered implementation.
Thanks very much for the GPU-based code suggestion!