burn icon indicating copy to clipboard operation
burn copied to clipboard

Unit tests are failing on Mac

Open antimora opened this issue 2 months ago • 9 comments

It appears ever since we disabled Mac CI, we have accumulated issues on Mac.

See tests errors on Mac:

uname -a
Darwin Mac.attlocal.net 24.6.0 Darwin Kernel Version 24.6.0: 
Mon Jul 14 11:30:55 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T6031 arm64

tch-errors.txt ndarray-errors.txt wgpu-errors.txt

Metal errors when I run:

[burn-wgpu]% cargo test --release --features metal

metal-errors.txt

antimora avatar Sep 24 '25 17:09 antimora

CC @laggui

antimora avatar Sep 24 '25 17:09 antimora

I was running cargo run-checks on my M2 mac, and also got some FP mis-matches. This was on main.

failures:

---- tests::autodiff::ad_div::tests::test_div_complex_2 stdout ----

thread 'tests::autodiff::ad_div::tests::test_div_complex_2' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
  => Position 0: 0.08251953 != 0.08333334
     diff (rel = +9.77e-3, abs = +8.14e-4), tol (rel = +5.00e-3, abs = +1.00e-5)
  => Position 2: -0.056803405 != -0.05555558
     diff (rel = +2.20e-2, abs = +1.25e-3), tol (rel = +5.00e-3, abs = +1.00e-5)
  => Position 3: -0.06800783 != -0.06714284
     diff (rel = +1.27e-2, abs = +8.65e-4), tol (rel = +5.00e-3, abs = +1.00e-5)

---- tests::autodiff::ad_log_sigmoid::tests::should_diff_log_sigmoid stdout ----

thread 'tests::autodiff::ad_log_sigmoid::tests::should_diff_log_sigmoid' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
  => Position 3: 0.001953125 != 0
     diff (rel = +1.00e0, abs = +1.95e-3), tol (rel = +5.00e-3, abs = +1.00e-5)

---- tests::autodiff_checkpointing::ad_div::tests::test_div_complex_2 stdout ----

thread 'tests::autodiff_checkpointing::ad_div::tests::test_div_complex_2' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
  => Position 0: 0.08251953 != 0.08333334
     diff (rel = +9.77e-3, abs = +8.14e-4), tol (rel = +5.00e-3, abs = +1.00e-5)
  => Position 2: -0.056803405 != -0.05555558
     diff (rel = +2.20e-2, abs = +1.25e-3), tol (rel = +5.00e-3, abs = +1.00e-5)
  => Position 3: -0.06800783 != -0.06714284
     diff (rel = +1.27e-2, abs = +8.65e-4), tol (rel = +5.00e-3, abs = +1.00e-5)

---- tests::autodiff_checkpointing::ad_log_sigmoid::tests::should_diff_log_sigmoid stdout ----

thread 'tests::autodiff_checkpointing::ad_log_sigmoid::tests::should_diff_log_sigmoid' panicked at crates/burn-ndarray/src/lib.rs:41:5:
Tensors are not approx eq:
  => Position 3: 0.001953125 != 0
     diff (rel = +1.00e0, abs = +1.95e-3), tol (rel = +5.00e-3, abs = +1.00e-5)


failures:
    tests::autodiff::ad_div::tests::test_div_complex_2
    tests::autodiff::ad_log_sigmoid::tests::should_diff_log_sigmoid
    tests::autodiff_checkpointing::ad_div::tests::test_div_complex_2
    tests::autodiff_checkpointing::ad_log_sigmoid::tests::should_diff_log_sigmoid

test result: FAILED. 1231 passed; 4 failed; 6 ignored; 0 measured; 0 filtered out; finished in 1.30s

TheGhostHuCodes avatar Sep 26 '25 01:09 TheGhostHuCodes

Just to address the OP:

  • tch errors are strictly for unimplemented! deform conv2d, so this is expected (though we should mark these tests differently or not run them with tch)
  • ndarray errors are floating-point precision errors for the exact tests pointed out in the comment above; we could probably relax the tolerance slightly
  • wgpu has different failure types:
    • should_diff_swap_dims: fusion autotune bug, fixed on main
    • is_finite failure: unclear what element is incorrect, we should change the test to use data.assert_eq instead but this might be an is_nan platform issue
    • slice_fill_1d: I think this was fixed with the rev update within your PR in #3776, no?

Metal errors seem to be an actual bug. The generated metal kernels don't seem to be loaded correctly via the passthrough:

thread 'tests::cube::autodiff::f16_ty::ad_aggregation::tests::should_diff_mean' panicked at /Users/dilshod/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-26.0.1/src/backend/wgpu_core.rs:1055:30:
wgpu error: Validation Error

Caused by:
  In Device::create_shader_module_passthrough, label = 'reduce_kernel_f16_f16_f16'
    Failed to generate the backend-specific code

---- tests::cube_fusion::autodiff::f16_ty::ad_expand::tests::should_diff_expand stdout ----

thread 'tests::cube_fusion::autodiff::f16_ty::ad_expand::tests::should_diff_expand' panicked at /Users/dilshod/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/wgpu-26.0.1/src/backend/wgpu_core.rs:1055:30:
wgpu error: Validation Error

Caused by:
  In Device::create_shader_module_passthrough, label = 'reduce_kernel'
    Failed to generate the backend-specific code

Interestingly, all the failing tests point to the reduce kernel. So it might be an unsupported feature on your setup that is used in the reduce kernel?

TL;DR: I think the only actual issues are in the metal tests, which might be using an unsupported feature for the reduce kernel. Other issues should have been fixed, or are just incorrect test setups (tolerance or not supported) that should be easily addressed.

laggui avatar Sep 29 '25 13:09 laggui

FWIW, I'm also hitting the failure on macOS with reduce_kernel_f16_f16_f16 as of recent main. This was working about 2 weeks ago, but I didn't have a chance to trace it down to the exact commit yet

AdrianEddy avatar Sep 29 '25 23:09 AdrianEddy

Ahh thanks for the pointer.

I tried to get access to a mac by stealing my girlfriend's old macbook but it's too old and doesn't have any of the devtools so I have to update a bunch of stuff 😅 which I didn't have time to.

Maybe enabling the wgpu logs RUST_LOG=trace (or individually RUST_LOG=wgpu_core=trace,wgc=trace,naga=trace) would give a bit more context as to why the reduce kernel entry point cannot be found. This requires a logger to be initialized though. And sharing the cubecl generated kernel for metal.

There have been some changes for runtime feature detection recently, perhaps my initial hunch is correct and one of the required features for a reduce implementations is not correctly detected. But would have to confirm.

laggui avatar Sep 30 '25 11:09 laggui

The reduce_kernel issue was introduced in commit 8ca52c9abdccd9eb0d449d56511f83a67f5b8132

I tested like this: cd crates/burn-import/onnx-tests rm -rf target ; cargo test --features test-metal

rm -rf target is needed to clear cubecl autotune cache each time, otherwise a lot of things break

AdrianEddy avatar Sep 30 '25 15:09 AdrianEddy

CC @wingertge

antimora avatar Sep 30 '25 15:09 antimora

I think this is https://github.com/tracel-ai/cubecl/issues/909. https://github.com/tracel-ai/burn/commit/8ca52c9abdccd9eb0d449d56511f83a67f5b8132 doesn't break it per se, but before that, subgroups were accidentally NOT enabled on metal, and that commit correctly enables them on Mac. However, they've been broken all along.

ArthurBrussee avatar Oct 01 '25 14:10 ArthurBrussee

I think this is tracel-ai/cubecl#909. 8ca52c9 doesn't break it per se, but before that, subgroups were accidentally NOT enabled on metal, and that commit correctly enables them on Mac. However, they've been broken all along.

Didn't see this cubecl issue, thanks for linking! Given the wgpu logs

MSL: "program_source:123:22: error: no matching function for call to 'simd_sum'\nconst float_4 l_61 = simd_sum(l_60);\n

that seems to be in line with the reduce tests above.

laggui avatar Oct 01 '25 14:10 laggui