Add a test for many timestamp query sets
This test is because Metal has a limit of 32 timestamp query sets. Implementations are supposed to workaround this limit by allocating larger metal sets and having WebGPU set be subsets in those larger sets.
This is especially important as the limit is 32 per process so a few pages making a few queries would easily hit the limit and prevent pages from running.
In issue 5261 65k queries was mentioned so this test creates 65536 (not sure if that's the same 65k Kai was referring to)
I think that maybe the issue should be reopened and the spec should state the minimum or it should defer it to the CTS. Except for a few explicit OOM tests, we don't allow implementations to return out-of-memory for all buffers and textures and claim to pass the CTS and I think we should do the same here and require implementations to pass this test too. Is 65536 good for this?
Requirements for PR author:
- [X] All missing test coverage is tracked with "TODO" or
.unimplemented(). - [X] New helpers are
/** documented */and new helper files are found inhelper_index.txt. - [X] Test behaves as expected in a WebGPU implementation. (If not passing, explain above.)
- [X] Test have be tested with compatibility mode validation enabled and behave as expected. (If not passing, explain above.)
Requirements for reviewer sign-off:
- [ ] Tests are properly located in the test tree.
- [ ] Test descriptions allow a reader to "read only the test plans and evaluate coverage completeness", and accurately reflect the test code.
- [ ] Tests provide complete coverage (including validation control cases). Missing coverage MUST be covered by TODOs.
- [ ] Helpers and types promote readability and maintainability.
When landing this PR, be sure to make any necessary issue status updates.
In issue 5261 65k queries was mentioned so this test creates 65536 (not sure if that's the same 65k Kai was referring to)
That was based on Mike's comment:
With a global limit of 32 sample buffers, 65,536 timestamp queries should be able to be in flight before we run into this limitation.
It may be good to (also) test more, but knowing that some implementations may choose not to deal with that because it's probably vanishingly unlikely to happen in practice.
It may be good to (also) test more, but knowing that some implementations may choose not to deal with that because it's probably vanishingly unlikely to happen in practice.
I'm unsure whether it's possible to deal with that in a single command buffer (like this test). Maybe the implementation could split the command buffer.
If across multiple command buffers it would be a test of the implementation's ability to clean up and reuse past slots, I think.
Is an implementation is required to split command buffers?
It seems like it should allocate 32k MTLCounterSampleBuffers (the max size). When the user asks for say count: N GPUQuerySet, type: 'timestamp it uses N slots of some 32k MTLCounterSampleBuffer. So max N is 4k slots. Since you can only use one GPUQuerySet per pass maybe the test should stop at 4k per pass.
I'm less concerned about an app using 4k slots and more concerned of the case of a few pages each using just a few slots each since if you don't virtualize the slots then you run out of MTLCounterSampleBuffers since you can only have 32 of them.
Is an implementation is required to split command buffers?
Implementation is not required to split command buffers, on Apple Silicon Macs and iOS devices WebKit never splits command buffers.
I'm not sure if splitting command buffers will resolve it as we don't know when or how often a site will call resolveQuerySet. Though perhaps if you run out of 65k counter sample buffers when a UA could implicitly resolve the counter sample buffer.
Of course if the same sample buffer was used in a later pass, then its contents would need to be repopulated. But the MTLCounterSampleBuffer doesn't allow for writing, only reading. So during resolveQuerySet, the UA would need to track which slots were written to and restore previously evicted values.
Certainly seems possible but even more effort for a scenario which is unlikely to occur in practice.
I wonder if we can have a test that uses more than 65536 timestamp queries in a single command buffer and just tests that it doesn't crash or lose the device? Timestamp queries are kind of best-effort anyway so I'm sure it's fine if they get bad results in this corner case. Maybe implementations could start skipping if there are too many timestamp queries (or something like, all timestamp queries past the 65535th get aliased to the same slot)
I've been assuming that a single GPUQuerySet would be implemented as some contiguous subset of a single MTLSampleCounterBuffer. With that implementation, the max count you can ask for in a single GPUQuerySet for timestamp queries is 4k slots.
With that implementation there are no splitting issues or resolve issues AFAICT.