[SDL3] GPU: Investigate adding a Query API
Not sure if other supported apis have the same concept of Vulkan's VkQueryPools, but would be great to have this exposed (via some SDL api wrapping) somehow to be able to query some gpu information.
What's your use case? We would have to do some background research to identify a good cross-platform API surface for this.
Mostly timing measurements in the rendering pipelines.
VK_CHECK(vkGetQueryPoolResults(device, queryPool, 0, ARRAYSIZE(queryResults), sizeof(queryResults), queryResults, sizeof(queryResults[0]), VK_QUERY_RESULT_64_BIT));
double frameGpuBegin = double(queryResults[0]) * props.limits.timestampPeriod * 1e-6;
double frameGpuEnd = double(queryResults[1]) * props.limits.timestampPeriod * 1e-6;
Once again, this might be too specific for Vulkan and troublesome/inexistent in other apis (I havent messed with Metal or DX at all).
This is maybe doable but Metal only supports very coarse timing queries. I'll flag this for 3.x in case anyone wants to investigate this after 3.2 is released.
There are also occlusion queries, they exist even in D3D9, incredibly important for games. Usually exposed via the same API surface as timing queries
I've been thinking about this more, and put together an API proposal for this. Looking for feedback before I start writing any code.
I propose adding a new queries type which supports timestamps, occlusion, and binary occlusion. Pipeline statistics are not included because support varies too much. Also left out is predicated rendering, which is not widely supported on android and non-existent in metal. Support for the proposed query types is very high, and I think any device which already supports SDL GPU will support these queries.
Vulkan: Timestamp queries: https://vulkan.gpuinfo.org/displaydevicelimit.php?platform=android&name=timestampPeriod Occlusion queries: https://vulkan.gpuinfo.org/listfeaturescore10.php
There were some questions about metal support for timestamps being too coarse, but I believe using this API will give the precise results we want (this is what MoltenVK now uses also): https://developer.apple.com/documentation/metal/sampling-gpu-data-into-counter-sample-buffers
Occlusion queries have a begin / end pair, with draw calls placed between them. Binary queries simply return 1 when ANY pixel is visible for the measured draw calls, where as the non binary (precise) query counts the number of visible pixels. For timestamps, either begin or end can be called, as it simply records the timestamp. For timestamps, there is an additional function which gets the GPU timestamp frequency. This is required to convert the timestamps into something usable.
There are two functions to retrieve the results. One function returns results to the CPU, which can optionally wait for the results to be ready. The other function directly copies the results to a gpu buffer, which can then be used for indirect rendering or as input to compute.
The backends will be very thin wrappers with he most complex detail being that for DX12 backend, SDL_GetGPUQueryResults will have to use a fence and staging buffer internally to keep track of result readiness whereas vulkan and metal have APIs that do this automatically.
Here is the proposed API:
typedef struct SDL_GPUQueries SDL_GPUQueries;
typedef enum {
SDL_GPU_QUERY_TIMESTAMP,
SDL_GPU_QUERY_OCCLUSION,
SDL_GPU_QUERY_BINARY_OCCLUSION
} SDL_GPU_QueryType;
/* Create a query pool of some query type with some number of available queries */
SDL_GPUQueries* SDL_CreateGPUQueries(SDL_GPUDevice *device, SDL_GPU_QueryType type, UInt32 query_count);
/* Destroy a query pool */
void SDL_ReleaseGPUQueries(SDL_GPUQueries* queries);
/* Begin an occlusion query at a specific index, or records a timestamp */
void SDL_BeginGPUQuery(SDL_GPUCommandBuffer *command_buffer, SDL_GPUQueries* queries, UInt32 query_index);
/* Ends an occlusion query at a specific index, or records a timestamp */
void SDL_EndGPUQuery(SDL_GPUCommandBuffer *command_buffer, SDL_GPUQueries* queries, UInt32 query_index);
/* Retrieve query results. Put results (uint64 or uint32 for timestamp or occlusion respectively) into results. Optionally block and wait for results to finish. Returns true if results were written, or false if results were not ready */
Bool SDL_GetGPUQueryResults(SDL_GPUQueries* queries, UInt32 first_query, UInt32 count, void* results, Bool wait);
/* Copy query results directly to a GPU buffer */
void SDL_CopyGPUQueryResultsToBuffer(SDL_GPUCopyPass* copy_pass, SDL_GPUQueries* queries, UInt32 first_query, UInt32 count, SDL_GPUBuffer* dest, UInt32 dest_offset);
/* Get GPU timestamp frequency, needed to compute actual wall clock times from timestamps */
UInt64 SDL_GetGPUTimestampFrequency(SDL_GPUDevice* device);
This looks mostly right to me. I have a few notes:
- Rename SDL_GPUQueries to SDL_GPUQueryPool.
- I don't like having two ways to do the same thing and the API should mirror how data retrieval works for other functions. I think that the only way should be to call the copy function and DownloadFromBuffer and then you can use a fence to know when it's ready.
Ok makes sense. Here is the updated proposal:
typedef struct SDL_GPUQueryPool SDL_GPUQueryPool;
typedef enum {
SDL_GPU_QUERY_TIMESTAMP,
SDL_GPU_QUERY_OCCLUSION,
SDL_GPU_QUERY_BINARY_OCCLUSION
} SDL_GPU_QueryType;
/* Create a query pool of some query type with some number of available queries */
SDL_GPUQueryPool* SDL_CreateGPUQueryPool(SDL_GPUDevice *device, SDL_GPU_QueryType type, UInt32 query_count);
/* Destroy a query pool */
void SDL_ReleaseGPUQueryPool(SDL_GPUQueryPool* query_pool);
/* Begin an occlusion query at a specific index, or records a timestamp */
void SDL_BeginGPUQuery(SDL_GPUCommandBuffer *command_buffer, SDL_GPUQueryPool* query_pool, UInt32 query_index);
/* Ends an occlusion query at a specific index, or records a timestamp */
void SDL_EndGPUQuery(SDL_GPUCommandBuffer *command_buffer, SDL_GPUQueryPool* query_pool, UInt32 query_index);
/* Copy query results directly to a GPU buffer */
void SDL_CopyGPUQueryResultsToBuffer(SDL_GPUCopyPass* copy_pass, SDL_GPUQueryPool* query_pool, UInt32 first_query, UInt32 count, SDL_GPUBuffer* dest, UInt32 dest_offset);
/* Get GPU timestamp frequency, needed to compute actual wall clock times from timestamps */
UInt64 SDL_GetGPUTimestampFrequency(SDL_GPUDevice* device);
Looks good.
Made a fork to start working on this. Will make a PR once it's a little further along. Can follow the progress here: https://github.com/savant117/SDL/tree/queries
Got the basics set up, and will start on the DX12 backed next. If anyone wants to help speed it along, feel free to contribute! Could maybe use some help with the metal backend, since I don't have a mac.
Maybe taking a look at how tracy handles thing would be good ? They already implement profiling for vulkan, directx, and openGL.
@savant117, this looks good, how's it going?
Today I was tinkering with counter sample buffers in Metal and threw together a very simple and dirty implementation for my debugging needs. Of course I’m not proposing the resulting “API” or the implementation itself, since I didn’t even try to align with Vulkan or DX, but some observations might still be useful. https://github.com/libsdl-org/SDL/compare/main...mr1name:SDL:metal-counter-samples
Even though Metal’s API formally documents a broad set of boundaries at which you can sample counters, if you run a test query on any Mac with an M-series chip, you’ll find that only atStageBoundary is actually supported.
https://developer.apple.com/documentation/metal/sampling-gpu-data-into-counter-sample-buffers
This is probably important to keep in mind, because it can potentially impose constraints on a generalized API. To sample via atStageBoundary, you have to specify the output buffer indices at compute/render pass initialization time.
e.g. For a render pass:
MTLRenderPassSampleBufferAttachmentDescriptor *attr = passDescriptor.sampleBufferAttachments[0];
attr.sampleBuffer = query.buffer->handle;
attr.startOfVertexSampleIndex = query.startTimeIndex;
attr.endOfVertexSampleIndex = MTLCounterDontSample;
attr.startOfFragmentSampleIndex = MTLCounterDontSample;
attr.endOfFragmentSampleIndex = query.endTimeIndex;
...
metalCommandBuffer->renderEncoder = [metalCommandBuffer->handle renderCommandEncoderWithDescriptor:passDescriptor];
e.g. For a compute pass:
MTLComputePassDescriptor *cpDesc = [MTLComputePassDescriptor computePassDescriptor];
MTLComputePassSampleBufferAttachmentDescriptor *attr = cpDesc.sampleBufferAttachments[0];
attr.sampleBuffer = query.buffer->handle;
attr.startOfEncoderSampleIndex = query.startTimeIndex;
attr.endOfEncoderSampleIndex = query.endTimeIndex;
...
metalCommandBuffer->computeEncoder = [metalCommandBuffer->handle computeCommandEncoderWithDescriptor:cpDesc];
So, for example, an API design with separate Begin/End calls, where the timestamp from End could be redirected into a different buffer slot, likely can’t be represented precisely in Metal. Unless you treat the Begin/End range as the sum of samples written in between via atStageBoundary, and specify the output pair already at Begin; or at the driver level immediately record (end_ts - begin_ts) and expose only a single value to the user; or create some extra intermediate buffer.
In any case, it would still be nice to have at least some kind of API for these purposes, even if it ends up being quite limited in flexibility compared to any of the backends.