astc-encoder
astc-encoder copied to clipboard
Preview of new decompressor
Background
We are working on a new standalone single-file-header library implementation of an ASTC decompressor. The goals for this are:
- Provide a fast fallback on platforms without ASTC available in hardware.
- Be fast enough that single threaded decompression is probably "enough", but allow threading via a user-managed thread pool.
- Be simple to integrate (single-file header with a minimal number of entry points).
Even though this branch is an astcenc integration for demonstration purposes, it is likely we will not productize this using the existing astcenc library API, as it is really too heavy for what this is trying to achieve.
Basic user guide
In one file in your project #define ASTC_SIMD_DECODE_IMPLEMENTATION before including astcenc_simd_decode.h, to pull in the function definitions.
The basic API usage is:
\\ Extract some basic image properties
uint64_t decoded_buffer_size = xdim * ydim * zdim * (do_hdr_decode ? 8 : 4);
uint8_t *decoded_buffer = (uint8_t *)malloc( decoded_buffer_size );
intptr_t src_row_stride = xblocks * 16;
intptr_t src_layer_stride = xblocks * yblocks * 16;
intptr_t dest_row_stride = xdim * (do_hdr_decode ? 8 : 4);
intptr_t dest_layer_stride = xdim * ydim * (do_hdr_decode ? 8 : 4);
\\ Populate the parameters structure
astc_simd_decode_params_t pars;
pars.dst_start_ptr = decoded_buffer;
pars.src_start_ptr = astc_payload;
pars.src_row_stride = src_row_stride;
pars.src_layer_stride = src_layer_stride;
pars.dst_row_stride = dest_row_stride;
pars.dst_layer_stride = dest_layer_stride;
pars.xres = xdim;
pars.yres = ydim;
pars.zres = zdim;
pars.block_xdim = ahdr.blockdim_x;
pars.block_ydim = ahdr.blockdim_y;
pars.block_zdim = ahdr.blockdim_z;
pars.decode_flags = do_hdr_decode ? 1 : 0;
\\ Generate the pre-processed parameters
\\ This only needs doing once and can be shared across threads
astc_simd_decode_processed_params_t *bd_args = (astc_simd_decode_processed_params_t *)malloc( astc_simd_decode_processed_params_size );
uint32_t number_of_rows = astc_simd_decode_prepare(bd_args, &pars );
\\ Iterate the decoder over the line buffers - this can be multithreaded if needed
for(int y = 0; y < number_of_rows; y++)
{
astc_simd_decode_row_iterate(bd_args, y);
}
Status
This branch is an integration of the decompressor into the existing astcenc library, as a preview for what it can do. Note that this is just a preview, and does not provide a complete implementation of all astcenc API functionality. Missing functionality may just crash and not error cleanly. Current known issues are:
- Decompressing 3D textures is supported by the new decompressor, but the output image must be N contiguous slices rather than the N disjoint slices in the current API. 3D textures can be decompressed, but require the output image slice[0] to point to a memory allocation large enough to store all xyz texels. Other slice array index values are ignored.
- Only the identity swizzle is supported. Output image swizzle parameter is ignored.
- Only the U8 and F16 output types are supported. The F32 type not fail, and will currently return F16 data.
- CPU target MUST support one of SSE3, AVX2 or AArch64 NEON. No "no SIMD" or SSE2 fallback is provided.
The code seems functional - extensive fuzz testing against the existing astcenc decompressor shows bit-exact output for everything we have tested against it. The code does need some stylistic cleanups, which I'll be working on next.
Memory usage
Decompression uses full image intermediate working buffers for performance, so has a higher working memory footprint than the current astcenc which only operates on single blocks at a time.
Performance
Decompressing a single Kodak suite montage image, with AVX2 the existing decompressor (astcenc 3.4) reaches:
- 1 core: 58MT/s
- 4 cores: 180MT/s (peak - more cores slows down)
With AX2 this decompressor reaches:
- 1 core: 190MT/s
- 4 cores: ~440MT/s
With a sufficiently large image and warmed up caches, a single thread can hit ~500MT/s although that is unlikely in real usage scenarios. We view this performance as sufficiently fast that other aspects of decompression (reading and writing files) are likely to become the more significant issues.