[hal metal] ray tracing acceleration structures
Connections Fixes: #7402
Description Implements the missing ray tracing acceleration structures in the HAL metal backend.
Testing
The examples ray_scene, ray_shadows, ray_cube_compute, ray_cube_fragment and ray_traced_triangle all work.
That is if invoked via cargo run --bin wgpu-examples ray_traced_triangle, but not via cargo xtask test ray_traced_triangle, still current CI runner is too old to catch that as it does not support hardware ray tracing.
Squash or Rebase? Squash
Checklist
- [x] Run
cargo fmt. - [x] Run
taplo format. - [x] Run
cargo clippy --tests - [x] Run
cargo xtask testto run tests. - [x] If this contains user-facing changes, add a
CHANGELOG.mdentry.
Glad there didn't need to be any wgpu-core changes
Almost, had to remove the Option<> around the buffers and always pass the dummy zero buffer when computing the size of the acceleration structures and their scratch buffers because Metal does not like nil.
I can split those first four commits into a separate PR if that helps with the review.
I just remembered that structures have minimum versions, and it seems MTLIndirectAccelerationStructureInstanceDescriptor required MacOS 14.0+ (probably should have checked that earlier...).
I just remembered that structures have minimum versions, and it seems MTLIndirectAccelerationStructureInstanceDescriptor required MacOS 14.0+ (probably should have checked that earlier...).
Bumped the min required version even further up.
I also managed to reduce the issue with the acelleration structure not intersecting any rays to a perfect reproducer and it is wild:
See the last commit "Bug reproducer", which modifies the ray_cube_fragment example to generate two BLASes: One with 152 triangles and one with 153 triangles.
With Metal on macOS the instances of the BLAS with 152 triangles (16344 bytes acceleration_structure_size) work as expected, but the ones with 153 triangles (16472 bytes acceleration_structure_size) suddenly stop intersecting rays after roughly 1.5 seconds no matter how many frames were rendered until then. 0x4000 = 2^14 = 16384 might be some special boundary being crossed. It also keeps happening even if I stop calling build_acceleration_structure() after the inital setup. Using MTLAccelerationStructureInstanceDescriptor or MTLIndirectAccelerationStructureInstanceDescriptor is also irrelevant. Same goes for calling encoder.use_resource_at(blas.as_native(), use_info.uses, use_info.stages) or not.
This also breaks Vulkan on Linux with a SIGSEGV upon Queue::submit: https://github.com/gfx-rs/wgpu/actions/runs/14820911901/job/41607697292?pr=7660
Using an example from metal-rs without wgpu does not reproduce this bug. It seems we are either lacking some validation step or are doing something wrong with our handling of acceleration structures in general.
@Vecvec: What testing hardware do you have available? Can you maybe see why Vulkan is failing this too?
@Vecvec: What testing hardware do you have available? Can you maybe see why Vulkan is failing this too?
I've got a couple of raytracing supported machines (plus llvmpipe which I will also be testing on). I'll have a look and see if I can get any ideas of what the issue might be.
Hits a divide by zero on Microsoft Basic Render Driver (though it doesn't seem to be related to the memory used, and only on one of my comuters). Can't get it to fail on the real gpus yet. Was able to reproduce the llvmpipe seg fault (edit: Don't think it's the same problem as the one here), will continue testing.
divide by zero
Might be that it tries to normalize a zero-length vector. The modified example does simply duplicate triangles so that could cause some vectors to become zero.
I narrowed the Metal issue down further and it is indeed caused by AccelerationStructureBuildSizes::acceleration_structure_size being greater or equal to 0x4000. For example if I modify device.new_acceleration_structure_with_size(descriptor.size.max(0x4000)) in Device::create_acceleration_structure() only (which is the latest point and makes sure that it is only related to the Metal backend) then all BLAS instances first work fine but disappear after 1.5 seconds. Reading the Metal docs it appears that 16384 (0x4000) is indeed used as API limit for other things like the mesh shader output buffer. So maybe there is a bug in the Metal driver, because I can not immagine that the limit for acceleration structure sizes is supposed to be so low.
Edit: Officially the limits are way higher, see https://developer.apple.com/documentation/metal/mtlaccelerationstructureusage/extendedlimits.
Most other resources are created with an auto release pool around them, is it possible that that is fixing this issue somehow?
Most other resources are created with an auto release pool around them, is it possible that that is fixing this issue somehow?
Added one in Device::create_acceleration_structure() but unfortunately that was not it either. There must be some other conditions to trigger it because the metal-rs examples don't and the wgpu examples only do when called via cargo xtask test.
I would say we try to land this PR and then open an issue for it to solve that separately.
BTW, I noticed the CI runner "Test Mac aarch64" job is not failing. Probably the test runner is too old to support hardware raytracing and skips the relevant tests.
Added one in
Device::create_acceleration_structure()but unfortunately that was not it either
That's annoying, I wonder what it could be
I would say we try to land this PR and then open an issue for it to solve that separately.
Yes, though it could be some time before it lands.
I noticed the CI runner "Test Mac aarch64" job is not failing. Probably the test runner is too old to support hardware raytracing and skips the relevant tests
I checked and it does skip.
Although setting TLAS dependencies and only stating that we are using the BLASes contained in them is good for optimization, the current implementation will not work due to it being possible to submit the encoders in a different order to the order they were recorded in. This can mean if we have recorded build 1 on encoder 1 and record build 2 on encoder 2 but encoder 2 was submitted before encoder 1 the TLAS contained in them would still have the BLASes from build 2 as its dependencies but would use the BLASes from build 1.
I think that it might be best if the program just claims to metal that we are using every BLAS that exists. Although there are some other possible options I think that they are too complex for an initial implementation. I can't think of many sensible reasons why people would be using large numbers of BLASes that aren't currently being used anyway.
the current implementation will not work due to it being possible to submit the encoders in a different order to the order they were recorded in
Instead of calling it inside command_encoder_build_acceleration_structures() we could call DynAccelerationStructure::set_dependencies() in Queue::submit(). There is only one queue per device in wgpu, right? So shouldn't that solve it too? Edit: Seems quite fiddly to wire it all the way through the command encoder, command buffer into the queue.
why people would be using large numbers of BLASes that aren't currently being used
Maybe you have every model in many LOD levels (to avoid high frequency noise in the distance) or do asset streaming? No idea either, hardware ray tracing is still somewhat new and I haven't seen that much code around it yet.
Instead of calling it inside command_encoder_build_acceleration_structures() we could call DynAccelerationStructure::set_dependencies() in Queue::submit().
I think there would still be issues where you encode the build after a use of the TLAS because you can't edit encoders after encoding them.
Maybe you have every model in many LOD levels (to avoid high frequency noise in the distance) or do asset streaming?
Yes, I hadn't thought of that. How expensive is the use_resources call? If it's cheap it might still be worthwhile to still just call it on all BLASes anyway.
Actually I've found something called a MTLResidencySet which seems like it could be used. I need to investigate it further, but it seems like you could keep one per command buffer and add all indirectly used BLASes to it. When submitted it could be committed and when the encoder was reset it would get cleared. Its very new though which is inconvenient.
I've found something called a MTLResidencySet
Interesting.
one per command buffer and add all indirectly used BLASes to it
That is essentially where we are right now with the dependency tracking. We add all indirectly used BLASes to the command buffer via use_resource().
Metal attaches all of a command queue’s residency sets to a command buffer from that queue when you call the command buffer’s commit() method.
I think I can simplify your counter example further: Imagine we build the same TLAS in two different command buffers, but we never submit (thus discard) the second. And then use that TLAS later in a render pass. Now, the actions in the second build of that TLAS should have no effect.
This might already be wrong in other aspects unrelated to this PR, like the validation layer and how it sees the dependencies.
This might already be wrong in other aspects unrelated to this PR, like the validation layer and how it sees the dependencies.
Yep, this has been pain for me. I've previously reworked lots of the validation due to this problem. It's possible there is more, but I've been working on fixing this.
That is essentially where we are right now with the dependency tracking. We add all indirectly used BLASes to the command buffer via use_resource().
Except MTLResidencySets can be edited after a command buffer is finished but make the resources resident before the same command buffer is submitted. The documentation of use_resources implies that the resources only are guaranteed to be resident after that command buffer has hit that point.
About cargo run --bin wgpu-examples ray_traced_triangle working, but cargo xtask test ray_traced_triangle not:
I think I found a bug in the Metal driver. Acceleration structures don't work in headless mode. That is, if I attach a window to the test process (does not even have to have its surface linked to wgpu, nor does the window have to be presented / visible in the compositor), the tests suddenly succeed!
Acceleration structures don't work in headless mode
That's an odd driver bug, I wonder how it's caused...
On another, completely unrelated, note, I've been looking at the possible ways to make the BLASes resident. I think there are 3 possible options based on the metal docs (which feel like they are very out of date):
-
Associate
BLASes with theirTLASin the build command- Use the
instancedAccelerationStructuresfield (though still usingMTLAccelerationStructureInstanceDescriptorType::indirect)
Pros
- Keeps most stuff the same.
Cons
- Unsure if this is allowed, can't find anything stating otherwise, but this quote suggests maybe not?
Each instance in the instance descriptor buffer has an index into this array
- Use the
-
Allocate all acceleration structures from a giant heap
- Terrible idea, should only use if all else fails
-
Put all indirectly used
BLASes into aMTLResidencySet- Keep this in the command buffer, add all
BLASes used to it just before submit, and then submit the command buffer
Pros
- Should work
Cons
- Requires latest MacOS version. (though if whatever this bug is cannot be worked around it will probably require latest version anyway)
- Keep this in the command buffer, add all
Fwiw I've never used Metal so I'm guessing based on docs alone and have probably missed some cool trick that all other impl.s use.
What's teh current status of this PR? @Vecvec what would the next steps be to be able to land this?
It's blocked on https://github.com/gfx-rs/metal-rs/pull/361. There also sounds to be some driver issue which only makes acceleration structures work when a window is there. @Lichtso would probably be able to give more details.
Edit: there is also a need to keep acceleration structures resident (I've listed potential solutions in https://github.com/gfx-rs/wgpu/pull/7660#issuecomment-2885596743)
Edit 2: The potential solutions were only the ones I found on metal's docs, I might look into how metal does its DXR/VKRay conversions.
Re: residency - could you call useResource?
Alright, I'll get that metal PR landed
there is also a need to keep acceleration structures resident (I've listed potential solutions in https://github.com/gfx-rs/wgpu/pull/7660#issuecomment-2885596743)
MTLResidencySet would also have to be exposed in metal-rs first. But I haven't even tried it yet.
The potential solutions were only the ones I found on metal's docs, I might look into how metal does its DXR/VKRay conversions.
MoltenVK has not implemented ray tracing either: (see https://github.com/KhronosGroup/MoltenVK/issues/427 and https://github.com/KhronosGroup/MoltenVK/issues/1956). Or were you thinking about another translation layer / project?
Re: residency - could you call useResource?
I think I mentioned this earlier, but it must be in some review comment. We can't due to allowing out of order BLAS builds. Basically:
- Record build in encoder 1 with blas 1 in tlas 1.
- Use tlas 1 in encoder 2.
- Record build in encoder 3 with blas 2 in tlas 1.
- Queue submit with encoder 3 then encoder 2.
Blas 1 would be resident while blas 2 wouldn't be, but blas 2 would need to be resident.
Edit: https://github.com/gfx-rs/wgpu/pull/7660#issuecomment-2874725365
MoltenVK has not implemented ray tracing either: (see https://github.com/KhronosGroup/MoltenVK/issues/427 and https://github.com/KhronosGroup/MoltenVK/issues/1956). Or were you thinking about another translation layer / project
I was thinking of Game Porting Tool Kit, ~~but maybe it doesn't support ray tracing either~~.
Edit at least the shader converter supports this, and I would assume apple would support all of it. https://developer.apple.com/metal/shader-converter/#changelog
Version | Changes | Requirements -- | -- | -- 2 | Support for shader debug information, globally-coherent memory access, and SV_CullPrimitive. | Globally-coherent memory access requires targeting macOS 15, iOS 18, or later. 1.1 | Support for ray tracing shaders. | Metal ray tracing support. 1 | Initial release. | Argument buffers tier 2 support.
@Lichtso did you file a bug with apple for acceleration structures not working w/o a window? It would be good to keep an eye on it in this PR (or when this PR lands, in an issue).
@Lichtso did you file a bug with apple for acceleration structures not working w/o a window? It would be good to keep an eye on it in this PR (or when this PR lands, in an issue).
No I haven't, yet. Would have to create minimized reproducer first and write it in Swift. Also, creating the window makes the difference, but it could be a second order effect like timing. E.g. creating the window yields to the kernel and the process is resumed later than if it didn't, things like that.
I was thinking of Game Porting Tool Kit
Ah, you mean the D3DMetal.framework but the source code for that is not public, binary distribution only.
Ah, you mean the D3DMetal.framework but the source code for that is not public, binary distribution only.
I'd assumed that apple might make some way of showing what each call translates to so that developers could port their own games so they didn't have to constantly rely on a translation layer, I guess that doesn't exist.
@Lichtso are you able to use a debugger on the tests? (It looks to be possible at least under cargo test) If so, could you see what acceleration structure sizes we are getting (in case something in metal is failing), whether they are different, and also look at what the acceleration structure pointer is - it is just possible that it is running into something similar to gfx-rs/metal#284. If all of those seem fine, could you try looking at the acceleration structures in the xcode acceleration structure inspector?