force_loop_bounding has massive performance regression
Description Some of my compute pipelines regressed from 0.1ms to 10ms after this (between wgpu 24->25).
This makes my game unplayable.
Well, I can disable this feature, no problem. But just saying that having this as default behavior is not a good idea.
Repro steps I have no idea how to properly repro this in an isolated manner (takes too much time and I don't have a good repro setup.). But here is the pipeline.
@compute @workgroup_size(8, 8, 1)
fn update_temperature(@builtin(global_invocation_id) invocation_id: vec3<u32>) {
let pos = vec2<i32>(invocation_id.xy);
let size = get_image_size(debug_image);
if (!is_inside(pos, size)) {
return ;
}
var sand = get_sand_with_temperature(pos, size);
var temperature = sand.temperature;
if (sand.is_on_fire) {
temperature = read_physics_rules(FIRE_SAND).temperature;
} else {
if (!is_empty(sand)) {
let thermal_conductivity = read_physics_rules(sand.id).heat_conductivity;
let conduction_rate = clamp(thermal_conductivity, 0.0, 1.0);
var diffusion_rate = 0.9 * conduction_rate;
let current_temp = sand.temperature;
if (current_temp < 0.0) {
diffusion_rate *= 0.05;
}
var neighbors = get_neighbors(pos, size);
var count = 0.0;
var sum = 0.0;
for (var i = 0; i < 8; i++) {
if (!is_empty(neighbors[i])) {
sum += neighbors[i].temperature;
count += 1.0;
}
}
if (count > 0.0) {
let delta = sum / count - current_temp;
let new_temperature = current_temp + diffusion_rate * delta;
temperature = clamp(new_temperature, MIN_TEMPERATURE, MAX_TEMPERATURE);
}
} else {
temperature = ROOM_TEMPERATURE;
}
}
// Assuming this may be needed, because we want threads to finish "reading" before writing
storageBarrier();
write_sand_temperature(pos, size, temperature);
}
Commenting out the loop (when force_loop_bounding is true) shows the times jump back to normal. And when disabling the checks, all is good again. Also, just commenting out the temperature =... does the same, meaning the compiler seems to ignore the loop if the var isn't set. But anyway, seems to be clearly related to this flag and #6929
I have plenty of loops, so I am unsure why only two out of my ~50 pipelines had this issue.
Expected vs observed behavior This performance regression is so large, that a production ready library shouldn't just make risky changes like this, or there should be some way to detect such perf regressions. Or default to the old behavior, rather than the new.
Platform Windows 11, GeForce RTX 3080 Ti Laptop GPU v1.85.0-nightly
Get neighbors returns a fixed size array, and the loop is clearly bounded. So not sure what the force_loop_bounding should do to help the code in this kind of case(s).
Vulkan backend
Thanks for the report! It's known that force_loop_bounding can cause severe perf regression, but what you're describing is particularly bad.
As described in the docs it's necessary for running untrusted code. Out of the box wgpu is supposed to behave like a WebGPU implementation (and wgpu-core is used for Firefox's WebGPU implementation) and as such it has to uphold strict safety requirements. Personally, I think therefore it's the right default since it also gives you both performance characteristics as-if running against WebGPU as well as safety more similar to what a user of safe Rust code would expect. But it's surely pretty bad & unexpected if you don't care about any of that (and there's many good reasons to not care). Maybe we should introduce a more explicit and easy to spot way on init to let the user pick up front what kind of characteristics they're looking for 🤔 - being as close to WebGPU as possible versus allowing some unsafety in trade of performance.
Btw. I really mean it when I say it's "unsafe" as in Rust-unsafe to run without force_loop_bounding: Since some shader compilers assume that unbounded loops are undefined behavior, they assume that any state causing them can not happen, therefore they may omit bounds check which then lead to out of bound read/writes -> unsafe in the Rust sense of unsafe. 😵
All that said, it's really curious how bad it gets. @jimblandy do you have additional insights there? Do we foresee mitigating this better? ... also I thought this was mostly an MSL concern, but I'm probably not quite up to date?
Thanks for your reply. I'm glad there was a toggle though, so nothing is blocking me and the update to 25 was alright otherwise. Just gave me a scare, the default behavior I mean.
What's weird about this is that this isn't that different from many of my other pipelines, and I've got loops elsewhere that don't have any problem. But two out of some ~100 pipelines got this issue, or begin hogging my GPU. There could be more spots, but not all get run all the time.
I use your awesome wgpu-profiler to track the performance of my pipelines.