burn YOLO11x model output does not match reference on metal backend

Describe the bug When running the YOLO11x ONNX model with the ndarray and metal backends, the model output does not match the reference output, even though the output shapes are correct. Large absolute differences are observed.

The model-checks still being reviewed here: https://github.com/tracel-ai/burn/pull/3599

To Reproduce Steps to reproduce the behavior:

Run the following command with the metal backend:

cd crates/burn-import/model-checks/yolo11x
cargo run --release --no-default-features --features metal

Observe the output:

========================================
YOLO11x Burn Model Test
========================================

Initializing YOLO11x model...
  Model initialized in 125.63ms

Loading test data from artifacts/test_data.pt...
  Data loaded in 5.59ms
  Loaded input tensor with shape: [1, 3, 640, 640]
  Loaded reference output with shape: [1, 84, 8400]

Running model inference with test input...
  Inference completed in 436.38ms

  Model output shape: [1, 84, 8400]
  ✓ Output shape matches expected: [1, 84, 8400]

Comparing model output with reference data...
  ⚠ Model output differs from reference data!
  Maximum absolute difference: 295.533875
  Mean absolute difference: 0.461984

  Sample values comparison (first 5 elements):
    [0] Model: 8.110577, Reference: 6.099144, Diff: 2.011433
    [1] Model: 16.759727, Reference: 17.930708, Diff: 1.170980
    [2] Model: 23.442308, Reference: 23.449240, Diff: 0.006931
    [3] Model: 31.128380, Reference: 34.504433, Diff: 3.376053
    [4] Model: 39.129543, Reference: 42.434673, Diff: 3.305130
========================================
Model test completed!
========================================

Run the following command with the ndarray backend:

cd crates/burn-import/model-checks/yolo11x
cargo run --release --no-default-features --features ndarray

Observe the output:

========================================
YOLO11x Burn Model Test
========================================

Initializing YOLO11x model...
  Model initialized in 44.31ms

Loading test data from artifacts/test_data.pt...
  Data loaded in 4.77ms
  Loaded input tensor with shape: [1, 3, 640, 640]
  Loaded reference output with shape: [1, 84, 8400]

Running model inference with test input...
  Inference completed in 2.66s

  Model output shape: [1, 84, 8400]
  ✓ Output shape matches expected: [1, 84, 8400]

Comparing model output with reference data...
  ⚠ Model output differs from reference data!
  Maximum absolute difference: 693.744568
  Mean absolute difference: 2.343200

  Sample values comparison (first 5 elements):
    [0] Model: -8.000000, Reference: 6.099144, Diff: 14.099144
    [1] Model: 28.000000, Reference: 17.930708, Diff: 10.069292
    [2] Model: 32.000000, Reference: 23.449240, Diff: 8.550760
    [3] Model: 28.000000, Reference: 34.504433, Diff: 6.504433
    [4] Model: 28.000000, Reference: 42.434673, Diff: 14.434673
========================================
Model test completed!
========================================

Expected behavior Model output should closely match the reference output on both ndarray and metal backends. Significant output differences are unexpected and may indicate a backend or operator implementation issue.

Torch backend passing:

    Finished `release` profile [optimized] target(s) in 0.33s
     Running `target/release/burn-import-model-checks-yolo11x`
========================================
YOLO11x Burn Model Test
========================================

Initializing YOLO11x model...
  Model initialized in 68.56ms

Loading test data from artifacts/test_data.pt...
  Data loaded in 6.27ms
  Loaded input tensor with shape: [1, 3, 640, 640]
  Loaded reference output with shape: [1, 84, 8400]

Running model inference with test input...
  Inference completed in 261.90ms

  Model output shape: [1, 84, 8400]
  ✓ Output shape matches expected: [1, 84, 8400]

Comparing model output with reference data...
  ✓ Model output matches reference data within tolerance (1e-4)!

========================================
Model test completed!
========================================

Additional context

See PR #3599 for reference ONNX model integration and test code.
This issue may be related to backend-specific operator behavior or model export.

Rust code: yolo11x_opset16.txt

Aug 22 '25 17:08 antimora

Summary of Findings

We've identified a real bug in the ndarray backend specifically for the SiLU pattern (Conv -> x * sigmoid(x)):

✅ Individual operations work correctly in ndarray: - Conv2d alone: ✅ Works - Sigmoid alone: ✅ Works - SiLU (x * sigmoid(x)) alone: ✅ Works
❌ Combined Conv2d -> SiLU fails in ndarray: - ndarray produces max diff of 0.174135 - tch produces exact match (0.000000 diff) - The issue happens whether weights are loaded from file OR default initialized
This explains the YOLO11x issue: YOLO models heavily use SiLU activations after convolutions, so this bug would compound throughout the network, leading to the massive differences (693.744568) reported in issue #3600.

The bug is specifically in how ndarray handles the sequence: Conv2d -> sigmoid(conv_out) -> conv_out * sigmoid_out when they're chained together in a model. This is a critical issue that needs to be fixed in the burn-ndarray backend implementation.

Aug 22 '25 20:08 antimora

https://github.com/tracel-ai/burn/pull/3604 fixes ndarray issue

Aug 23 '25 09:08 antimora

But the issue remains with metal backend

Aug 23 '25 09:08 antimora

I run tests comparing outputs from different layers and they seem to pass:

========================================
Critical Layer Backend Testing
========================================

Testing layers with uniform weights (triggers Conv2d bug):


📊 Linear Layer:
----------------------------------------
Linear tch vs metal: max_diff = 0.00000000
Linear tch vs ndarray: max_diff = 0.00000000

📊 Conv1d Layer:
----------------------------------------
Conv1d tch vs metal: max_diff = 0.00000000
Conv1d tch vs ndarray: max_diff = 0.00000000

📊 Conv2d Layer (YOLO config):
----------------------------------------
Conv2d tch vs metal: max_diff = 0.00005767
Conv2d tch vs ndarray: max_diff = 0.00000000

📊 ConvTranspose2d Layer:
----------------------------------------
ConvTranspose2d tch vs metal: max_diff = 0.00000000
ConvTranspose2d tch vs ndarray: max_diff = 0.00000072

📊 MaxPool2d Layer:
----------------------------------------
MaxPool2d tch vs metal: max_diff = 0.00000000
MaxPool2d tch vs ndarray: max_diff = 0.00000000

📊 AvgPool2d Layer:
----------------------------------------
AvgPool2d tch vs metal: max_diff = 0.00000000
AvgPool2d tch vs ndarray: max_diff = 0.00000000

📊 Interpolate Layer:
----------------------------------------
Interpolate(Bilinear) tch vs metal: max_diff = 0.00000638
Interpolate(Bilinear) tch vs ndarray: max_diff = 0.00000206
Interpolate(Nearest) tch vs metal: max_diff = 0.00000000
Interpolate(Nearest) tch vs ndarray: max_diff = 0.00000000

📊 Activation Functions:
----------------------------------------

Testing activations tch vs metal:
  ReLU: 0.00000000
  Sigmoid: 0.00000018
  Tanh: 0.00000012
  GELU: 0.00000024
  SiLU: 0.00000072

Testing activations tch vs ndarray:
  ReLU: 0.00000000
  Sigmoid: 0.00000012
  Tanh: 0.00000000
  GELU: 0.00000048
  SiLU: 0.00000048

========================================
SUMMARY
========================================

Layer               | Metal vs Tch    | Ndarray vs Tch  | Status
--------------------|-----------------|-----------------|--------
Linear              | 0.00000000 | 0.00000000 | ✅
Conv1d              | 0.00000000 | 0.00000000 | ✅
Conv2d              | 0.00005767 | 0.00000000 | ✅
ConvTranspose2d     | 0.00000000 | 0.00000072 | ✅
MaxPool2d           | 0.00000000 | 0.00000000 | ✅
AvgPool2d           | 0.00000000 | 0.00000000 | ✅
Interpolate(Bilin)  | 0.00000638 | 0.00000206 | ✅
Interpolate(Near)   | 0.00000000 | 0.00000000 | ✅

⚠️ Threshold for failure: 0.0001

Aug 23 '25 21:08 antimora

Latest findings. Use https://github.com/tracel-ai/burn/pull/3609 PR to recreate

The deep network test reveals interesting patterns:

MLP (Linear layers): Perfect accuracy across all backends with 30 layers
Attention Stack: Very stable with minimal error accumulation (< 0.000003)
CNN: Moderate error growth, especially with Metal backend
ResNet: Catastrophic error explosion with Metal backend (101M max diff!) but perfect with ndarray

The ResNet's extreme error with Metal suggests the residual connections are amplifying numerical errors exponentially. This is a critical finding for the YOLO11x model which likely uses similar skip connections.

========================================
Deep Network Error Accumulation Test
========================================

Testing how errors accumulate through deep networks

==================================================
TEST 1: Deep CNN (20 layers)
==================================================

🏗️ Building 20-layer deep CNN for tch vs metal
  Input: [1, 3, 128, 128]
  Layer  1 (Conv2d 3->6): max_diff = 0.00000001
  Layer  2 (MaxPool2d): max_diff = 0.00000001, size now 64x64
  Layer  3 (Conv2d 6->12): max_diff = 0.00000954
  Layer  4 (InstanceNorm): max_diff = 0.00055087
  Layer  5 (Conv2d 12->24): max_diff = 0.00078678
  Layer  6 (MaxPool2d): max_diff = 0.00078678, size now 32x32
  Layer  7 (Conv2d 24->48): max_diff = 0.00072074
  Layer  8 (InstanceNorm): max_diff = 0.00082397
  Layer  9 (Conv2d 48->96): max_diff = 0.00202739
  Layer 10 (MaxPool2d): max_diff = 0.00202739, size now 16x16
  Layer 11 (Conv2d 96->192): max_diff = 0.00806236
  Layer 12 (InstanceNorm): max_diff = 0.00085898
  Layer 13 (Conv2d 192->384): max_diff = 0.00719213
  Layer 14 (MaxPool2d): max_diff = 0.00710869, size now 8x8
  Layer 15 (Conv2d 384->768): max_diff = 0.05778503
  Layer 16 (InstanceNorm): max_diff = 0.00021422
  Layer 17 (Conv2d 384->768): max_diff = 0.00925446
  Layer 19 (Conv2d 384->768): max_diff = 0.55859375
  Layer 20 (InstanceNorm): max_diff = 0.00023580

🏗️ Building 20-layer deep CNN for tch vs ndarray
  Input: [1, 3, 128, 128]
  Layer  1 (Conv2d 3->6): max_diff = 0.00000001
  Layer  2 (MaxPool2d): max_diff = 0.00000001, size now 64x64
  Layer  3 (Conv2d 6->12): max_diff = 0.00000003
  Layer  4 (InstanceNorm): max_diff = 0.00000516
  Layer  5 (Conv2d 12->24): max_diff = 0.00000459
  Layer  6 (MaxPool2d): max_diff = 0.00000453, size now 32x32
  Layer  7 (Conv2d 24->48): max_diff = 0.00000763
  Layer  8 (InstanceNorm): max_diff = 0.00001431
  Layer  9 (Conv2d 48->96): max_diff = 0.00004864
  Layer 10 (MaxPool2d): max_diff = 0.00003529, size now 16x16
  Layer 11 (Conv2d 96->192): max_diff = 0.00009918
  Layer 12 (InstanceNorm): max_diff = 0.00001144
  Layer 13 (Conv2d 192->384): max_diff = 0.00014496
  Layer 14 (MaxPool2d): max_diff = 0.00012207, size now 8x8
  Layer 15 (Conv2d 384->768): max_diff = 0.00291443
  Layer 16 (InstanceNorm): max_diff = 0.00001146
  Layer 17 (Conv2d 384->768): max_diff = 0.00043488
  Layer 19 (Conv2d 384->768): max_diff = 0.00668335
  Layer 20 (InstanceNorm): max_diff = 0.00000364

📈 Error Growth Analysis for CNN (Metal):
  Initial error: 0.00000001
  Final error:   0.00023580
  Total growth:  16878.93x
  Max growth:    682.67x at layer 3
  ⚠️ WARNING: Exponential error growth detected! (avg: 1.67x per layer)

📈 Error Growth Analysis for CNN (Ndarray):
  Initial error: 0.00000001
  Final error:   0.00000364
  Total growth:  300.31x
  Max growth:    153.78x at layer 4
  ⚠️ WARNING: Exponential error growth detected! (avg: 1.35x per layer)

==================================================
TEST 2: ResNet-like (10 residual blocks = ~20 layers)
==================================================

🏗️ Building 10-block ResNet-like network for tch vs metal
  Block  1: max_diff = 0.00000030
  Block  2: max_diff = 0.00002766
  Block  3: max_diff = 0.00126648
  Block  4: max_diff = 0.06152344
  Block  5: max_diff = 1.56250000
  Block  6: max_diff = 56.50000000
  Block  7: max_diff = 2160.00000000
  Block  8: max_diff = 100864.00000000
  Block  9: max_diff = 2998272.00000000
  Block 10: max_diff = 101711872.00000000

🏗️ Building 10-block ResNet-like network for tch vs ndarray
  Block  1: max_diff = 0.00000000
  Block  2: max_diff = 0.00000000
  Block  3: max_diff = 0.00000000
  Block  4: max_diff = 0.00000000
  Block  5: max_diff = 0.00000000
  Block  6: max_diff = 0.00000000
  Block  7: max_diff = 0.00000000
  Block  8: max_diff = 0.00000000
  Block  9: max_diff = 0.00000000
  Block 10: max_diff = 0.00000000

📈 Error Growth Analysis for ResNet (Metal):
  Initial error: 0.00000030
  Final error:   101711872.00000000
  Total growth:  341288402550784.00x
  Max growth:    92.80x at layer 2
  ⚠️ WARNING: Exponential error growth detected! (avg: 28.40x per layer)

📈 Error Growth Analysis for ResNet (Ndarray):
  Initial error: 0.00000000
  Final error:   0.00000000
  Total growth:  0.00x
  Max growth:    0.00x at layer 1
  ✅ Error is stable or decreasing

==================================================
TEST 3: Deep MLP (30 layers)
==================================================

🏗️ Building 30-layer deep MLP for tch vs metal
  Layer  1 (Linear + ReLU): max_diff = 0.00000000
  Layer  2 (Linear + GELU): max_diff = 0.00000000
  Layer  3 (Linear + ReLU): max_diff = 0.00000000
  Layer  4 (Linear + GELU): max_diff = 0.00000000
  Layer  5 (Linear + ReLU): max_diff = 0.00000000
  Layer  6 (Linear + GELU): max_diff = 0.00000000
  Layer  7 (Linear + ReLU): max_diff = 0.00000000
  Layer  8 (Linear + GELU): max_diff = 0.00000000
  Layer  9 (Linear + ReLU): max_diff = 0.00000000
  Layer 10 (Linear + GELU): max_diff = 0.00000000
  Layer 11 (Linear + ReLU): max_diff = 0.00000000
  Layer 12 (Linear + GELU): max_diff = 0.00000000
  Layer 13 (Linear + ReLU): max_diff = 0.00000000
  Layer 14 (Linear + GELU): max_diff = 0.00000000
  Layer 15 (Linear + ReLU): max_diff = 0.00000000
  Layer 16 (Linear + GELU): max_diff = 0.00000000
  Layer 17 (Linear + ReLU): max_diff = 0.00000000
  Layer 18 (Linear + GELU): max_diff = 0.00000000
  Layer 19 (Linear + ReLU): max_diff = 0.00000000
  Layer 20 (Linear + GELU): max_diff = 0.00000000
  Layer 21 (Linear + ReLU): max_diff = 0.00000000
  Layer 22 (Linear + GELU): max_diff = 0.00000000
  Layer 23 (Linear + ReLU): max_diff = 0.00000000
  Layer 24 (Linear + GELU): max_diff = 0.00000000
  Layer 25 (Linear + ReLU): max_diff = 0.00000000
  Layer 26 (Linear + GELU): max_diff = 0.00000000
  Layer 27 (Linear + ReLU): max_diff = 0.00000000
  Layer 28 (Linear + GELU): max_diff = 0.00000000
  Layer 29 (Linear + ReLU): max_diff = 0.00000000
  Layer 30 (Linear + GELU): max_diff = 0.00000000

🏗️ Building 30-layer deep MLP for tch vs ndarray
  Layer  1 (Linear + ReLU): max_diff = 0.00000000
  Layer  2 (Linear + GELU): max_diff = 0.00000000
  Layer  3 (Linear + ReLU): max_diff = 0.00000000
  Layer  4 (Linear + GELU): max_diff = 0.00000000
  Layer  5 (Linear + ReLU): max_diff = 0.00000001
  Layer  6 (Linear + GELU): max_diff = 0.00000000
  Layer  7 (Linear + ReLU): max_diff = 0.00000000
  Layer  8 (Linear + GELU): max_diff = 0.00000000
  Layer  9 (Linear + ReLU): max_diff = 0.00000002
  Layer 10 (Linear + GELU): max_diff = 0.00000001
  Layer 11 (Linear + ReLU): max_diff = 0.00000002
  Layer 12 (Linear + GELU): max_diff = 0.00000001
  Layer 13 (Linear + ReLU): max_diff = 0.00000001
  Layer 14 (Linear + GELU): max_diff = 0.00000001
  Layer 15 (Linear + ReLU): max_diff = 0.00000001
  Layer 16 (Linear + GELU): max_diff = 0.00000001
  Layer 17 (Linear + ReLU): max_diff = 0.00000001
  Layer 18 (Linear + GELU): max_diff = 0.00000002
  Layer 19 (Linear + ReLU): max_diff = 0.00000003
  Layer 20 (Linear + GELU): max_diff = 0.00000001
  Layer 21 (Linear + ReLU): max_diff = 0.00000000
  Layer 22 (Linear + GELU): max_diff = 0.00000000
  Layer 23 (Linear + ReLU): max_diff = 0.00000001
  Layer 24 (Linear + GELU): max_diff = 0.00000001
  Layer 25 (Linear + ReLU): max_diff = 0.00000001
  Layer 26 (Linear + GELU): max_diff = 0.00000000
  Layer 27 (Linear + ReLU): max_diff = 0.00000001
  Layer 28 (Linear + GELU): max_diff = 0.00000000
  Layer 29 (Linear + ReLU): max_diff = 0.00000000
  Layer 30 (Linear + GELU): max_diff = 0.00000000

📈 Error Growth Analysis for MLP (Metal):
  Initial error: 0.00000000
  Final error:   0.00000000
  Total growth:  0.00x
  Max growth:    0.00x at layer 1
  ✅ Error is stable or decreasing

📈 Error Growth Analysis for MLP (Ndarray):
  Initial error: 0.00000000
  Final error:   0.00000000
  Total growth:  0.00x
  Max growth:    48.00x at layer 5
  ✅ Error is stable or decreasing

==================================================
TEST 4: Attention Stack (12 layers - like BERT)
==================================================

🏗️ Building 12-layer attention stack for tch vs metal
  Layer  1 (Self-Attention + LayerNorm): max_diff = 0.00000036
  Layer  2 (Self-Attention + LayerNorm): max_diff = 0.00000107
  Layer  3 (Self-Attention + LayerNorm): max_diff = 0.00000125
  Layer  4 (Self-Attention + LayerNorm): max_diff = 0.00000161
  Layer  5 (Self-Attention + LayerNorm): max_diff = 0.00000158
  Layer  6 (Self-Attention + LayerNorm): max_diff = 0.00000153
  Layer  7 (Self-Attention + LayerNorm): max_diff = 0.00000150
  Layer  8 (Self-Attention + LayerNorm): max_diff = 0.00000137
  Layer  9 (Self-Attention + LayerNorm): max_diff = 0.00000137
  Layer 10 (Self-Attention + LayerNorm): max_diff = 0.00000136
  Layer 11 (Self-Attention + LayerNorm): max_diff = 0.00000134
  Layer 12 (Self-Attention + LayerNorm): max_diff = 0.00000133

🏗️ Building 12-layer attention stack for tch vs ndarray
  Layer  1 (Self-Attention + LayerNorm): max_diff = 0.00000024
  Layer  2 (Self-Attention + LayerNorm): max_diff = 0.00000095
  Layer  3 (Self-Attention + LayerNorm): max_diff = 0.00000119
  Layer  4 (Self-Attention + LayerNorm): max_diff = 0.00000149
  Layer  5 (Self-Attention + LayerNorm): max_diff = 0.00000176
  Layer  6 (Self-Attention + LayerNorm): max_diff = 0.00000173
  Layer  7 (Self-Attention + LayerNorm): max_diff = 0.00000209
  Layer  8 (Self-Attention + LayerNorm): max_diff = 0.00000221
  Layer  9 (Self-Attention + LayerNorm): max_diff = 0.00000262
  Layer 10 (Self-Attention + LayerNorm): max_diff = 0.00000324
  Layer 11 (Self-Attention + LayerNorm): max_diff = 0.00000316
  Layer 12 (Self-Attention + LayerNorm): max_diff = 0.00000298

📈 Error Growth Analysis for Attention (Metal):
  Initial error: 0.00000036
  Final error:   0.00000133
  Total growth:  3.71x
  Max growth:    3.00x at layer 2
  ⚠️ WARNING: Exponential error growth detected! (avg: 1.12x per layer)

📈 Error Growth Analysis for Attention (Ndarray):
  Initial error: 0.00000024
  Final error:   0.00000298
  Total growth:  12.50x
  Max growth:    4.00x at layer 2
  ⚠️ WARNING: Exponential error growth detected! (avg: 1.23x per layer)

==================================================
FINAL SUMMARY
==================================================

Final errors after full depth:
Architecture    | Layers | Metal Error  | Ndarray Error | Status
----------------|--------|--------------|---------------|--------
CNN             | 20     | 0.00023580 | 0.00000364   | ✅
ResNet          | ~20    | 101711872.00000000 | 0.00000000   | ❌
MLP             | 30     | 0.00000000 | 0.00000000   | ✅
Attention       | 12     | 0.00000133 | 0.00000298   | ✅

⚠️ Error threshold for failure: 1.00%

🎯 Sensitivity Analysis:
  Most error-prone: ResNet (max error: 101711872.00000000)
  Most stable:      MLP (max error: 0.00000000)

========================================

Aug 24 '25 17:08 antimora

CC: @laggui @nathanielsimard @louisfd @wingertge

I submitted a PR to recreate the issue.

Aug 24 '25 19:08 antimora

burn burn copied to clipboard

YOLO11x model output does not match reference on metal backend

burn
burn copied to clipboard