burn
burn copied to clipboard
Unit tests are failing on Mac
It appears ever since we disabled Mac CI, we have accumulated issues on Mac.
See tests errors on Mac:
uname -a
Darwin Mac.attlocal.net 24.6.0 Darwin Kernel Version 24.6.0:
Mon Jul 14 11:30:55 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T6031 arm64
tch-errors.txt ndarray-errors.txt wgpu-errors.txt
Metal errors when I run:
[burn-wgpu]% cargo test --release --features metal
CC @laggui
I was running cargo run-checks on my M2 mac, and also got some FP mis-matches. This was on main.
failures:
---- tests::autodiff::ad_div::tests::test_div_complex_2 stdout ----
thread 'tests::autodiff::ad_div::tests::test_div_complex_2' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
=> Position 0: 0.08251953 != 0.08333334
diff (rel = +9.77e-3, abs = +8.14e-4), tol (rel = +5.00e-3, abs = +1.00e-5)
=> Position 2: -0.056803405 != -0.05555558
diff (rel = +2.20e-2, abs = +1.25e-3), tol (rel = +5.00e-3, abs = +1.00e-5)
=> Position 3: -0.06800783 != -0.06714284
diff (rel = +1.27e-2, abs = +8.65e-4), tol (rel = +5.00e-3, abs = +1.00e-5)
---- tests::autodiff::ad_log_sigmoid::tests::should_diff_log_sigmoid stdout ----
thread 'tests::autodiff::ad_log_sigmoid::tests::should_diff_log_sigmoid' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
=> Position 3: 0.001953125 != 0
diff (rel = +1.00e0, abs = +1.95e-3), tol (rel = +5.00e-3, abs = +1.00e-5)
---- tests::autodiff_checkpointing::ad_div::tests::test_div_complex_2 stdout ----
thread 'tests::autodiff_checkpointing::ad_div::tests::test_div_complex_2' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
=> Position 0: 0.08251953 != 0.08333334
diff (rel = +9.77e-3, abs = +8.14e-4), tol (rel = +5.00e-3, abs = +1.00e-5)
=> Position 2: -0.056803405 != -0.05555558
diff (rel = +2.20e-2, abs = +1.25e-3), tol (rel = +5.00e-3, abs = +1.00e-5)
=> Position 3: -0.06800783 != -0.06714284
diff (rel = +1.27e-2, abs = +8.65e-4), tol (rel = +5.00e-3, abs = +1.00e-5)
---- tests::autodiff_checkpointing::ad_log_sigmoid::tests::should_diff_log_sigmoid stdout ----
thread 'tests::autodiff_checkpointing::ad_log_sigmoid::tests::should_diff_log_sigmoid' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
=> Position 3: 0.001953125 != 0
diff (rel = +1.00e0, abs = +1.95e-3), tol (rel = +5.00e-3, abs = +1.00e-5)
failures:
tests::autodiff::ad_div::tests::test_div_complex_2
tests::autodiff::ad_log_sigmoid::tests::should_diff_log_sigmoid
tests::autodiff_checkpointing::ad_div::tests::test_div_complex_2
tests::autodiff_checkpointing::ad_log_sigmoid::tests::should_diff_log_sigmoid
test result: FAILED. 1231 passed; 4 failed; 6 ignored; 0 measured; 0 filtered out; finished in 1.30s
Just to address the OP:
tcherrors are strictly forunimplemented!deform conv2d, so this is expected (though we should mark these tests differently or not run them with tch)ndarrayerrors are floating-point precision errors for the exact tests pointed out in the comment above; we could probably relax the tolerance slightlywgpuhas different failure types:should_diff_swap_dims: fusion autotune bug, fixed on mainis_finitefailure: unclear what element is incorrect, we should change the test to usedata.assert_eqinstead but this might be anis_nanplatform issueslice_fill_1d: I think this was fixed with the rev update within your PR in #3776, no?
Metal errors seem to be an actual bug. The generated metal kernels don't seem to be loaded correctly via the passthrough:
thread 'tests::cube::autodiff::f16_ty::ad_aggregation::tests::should_diff_mean' panicked at /Users/dilshod/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-26.0.1/src/backend/wgpu_core.rs:1055:30:
wgpu error: Validation Error
Caused by:
In Device::create_shader_module_passthrough, label = 'reduce_kernel_f16_f16_f16'
Failed to generate the backend-specific code
---- tests::cube_fusion::autodiff::f16_ty::ad_expand::tests::should_diff_expand stdout ----
thread 'tests::cube_fusion::autodiff::f16_ty::ad_expand::tests::should_diff_expand' panicked at /Users/dilshod/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-26.0.1/src/backend/wgpu_core.rs:1055:30:
wgpu error: Validation Error
Caused by:
In Device::create_shader_module_passthrough, label = 'reduce_kernel'
Failed to generate the backend-specific code
Interestingly, all the failing tests point to the reduce kernel. So it might be an unsupported feature on your setup that is used in the reduce kernel?
TL;DR: I think the only actual issues are in the metal tests, which might be using an unsupported feature for the reduce kernel. Other issues should have been fixed, or are just incorrect test setups (tolerance or not supported) that should be easily addressed.
FWIW, I'm also hitting the failure on macOS with reduce_kernel_f16_f16_f16 as of recent main. This was working about 2 weeks ago, but I didn't have a chance to trace it down to the exact commit yet
Ahh thanks for the pointer.
I tried to get access to a mac by stealing my girlfriend's old macbook but it's too old and doesn't have any of the devtools so I have to update a bunch of stuff 😅 which I didn't have time to.
Maybe enabling the wgpu logs RUST_LOG=trace (or individually RUST_LOG=wgpu_core=trace,wgc=trace,naga=trace) would give a bit more context as to why the reduce kernel entry point cannot be found. This requires a logger to be initialized though. And sharing the cubecl generated kernel for metal.
There have been some changes for runtime feature detection recently, perhaps my initial hunch is correct and one of the required features for a reduce implementations is not correctly detected. But would have to confirm.
The reduce_kernel issue was introduced in commit 8ca52c9abdccd9eb0d449d56511f83a67f5b8132
I tested like this:
cd crates/burn-import/onnx-tests
rm -rf target ; cargo test --features test-metal
rm -rf target is needed to clear cubecl autotune cache each time, otherwise a lot of things break
CC @wingertge
I think this is https://github.com/tracel-ai/cubecl/issues/909. https://github.com/tracel-ai/burn/commit/8ca52c9abdccd9eb0d449d56511f83a67f5b8132 doesn't break it per se, but before that, subgroups were accidentally NOT enabled on metal, and that commit correctly enables them on Mac. However, they've been broken all along.
I think this is tracel-ai/cubecl#909. 8ca52c9 doesn't break it per se, but before that, subgroups were accidentally NOT enabled on metal, and that commit correctly enables them on Mac. However, they've been broken all along.
Didn't see this cubecl issue, thanks for linking! Given the wgpu logs
MSL: "program_source:123:22: error: no matching function for call to 'simd_sum'\nconst float_4 l_61 = simd_sum(l_60);\n
that seems to be in line with the reduce tests above.