burn
burn copied to clipboard
YOLO11x model output does not match reference on metal backend
Describe the bug
When running the YOLO11x ONNX model with the ndarray and metal backends, the model output does not match the reference output, even though the output shapes are correct. Large absolute differences are observed.
The model-checks still being reviewed here: https://github.com/tracel-ai/burn/pull/3599
To Reproduce Steps to reproduce the behavior:
-
Run the following command with the metal backend:
cd crates/burn-import/model-checks/yolo11x cargo run --release --no-default-features --features metalObserve the output:
======================================== YOLO11x Burn Model Test ======================================== Initializing YOLO11x model... Model initialized in 125.63ms Loading test data from artifacts/test_data.pt... Data loaded in 5.59ms Loaded input tensor with shape: [1, 3, 640, 640] Loaded reference output with shape: [1, 84, 8400] Running model inference with test input... Inference completed in 436.38ms Model output shape: [1, 84, 8400] ✓ Output shape matches expected: [1, 84, 8400] Comparing model output with reference data... ⚠ Model output differs from reference data! Maximum absolute difference: 295.533875 Mean absolute difference: 0.461984 Sample values comparison (first 5 elements): [0] Model: 8.110577, Reference: 6.099144, Diff: 2.011433 [1] Model: 16.759727, Reference: 17.930708, Diff: 1.170980 [2] Model: 23.442308, Reference: 23.449240, Diff: 0.006931 [3] Model: 31.128380, Reference: 34.504433, Diff: 3.376053 [4] Model: 39.129543, Reference: 42.434673, Diff: 3.305130 ======================================== Model test completed! ======================================== -
Run the following command with the ndarray backend:
cd crates/burn-import/model-checks/yolo11x cargo run --release --no-default-features --features ndarrayObserve the output:
======================================== YOLO11x Burn Model Test ======================================== Initializing YOLO11x model... Model initialized in 44.31ms Loading test data from artifacts/test_data.pt... Data loaded in 4.77ms Loaded input tensor with shape: [1, 3, 640, 640] Loaded reference output with shape: [1, 84, 8400] Running model inference with test input... Inference completed in 2.66s Model output shape: [1, 84, 8400] ✓ Output shape matches expected: [1, 84, 8400] Comparing model output with reference data... ⚠ Model output differs from reference data! Maximum absolute difference: 693.744568 Mean absolute difference: 2.343200 Sample values comparison (first 5 elements): [0] Model: -8.000000, Reference: 6.099144, Diff: 14.099144 [1] Model: 28.000000, Reference: 17.930708, Diff: 10.069292 [2] Model: 32.000000, Reference: 23.449240, Diff: 8.550760 [3] Model: 28.000000, Reference: 34.504433, Diff: 6.504433 [4] Model: 28.000000, Reference: 42.434673, Diff: 14.434673 ======================================== Model test completed! ========================================
Expected behavior Model output should closely match the reference output on both ndarray and metal backends. Significant output differences are unexpected and may indicate a backend or operator implementation issue.
Torch backend passing:
Finished `release` profile [optimized] target(s) in 0.33s
Running `target/release/burn-import-model-checks-yolo11x`
========================================
YOLO11x Burn Model Test
========================================
Initializing YOLO11x model...
Model initialized in 68.56ms
Loading test data from artifacts/test_data.pt...
Data loaded in 6.27ms
Loaded input tensor with shape: [1, 3, 640, 640]
Loaded reference output with shape: [1, 84, 8400]
Running model inference with test input...
Inference completed in 261.90ms
Model output shape: [1, 84, 8400]
✓ Output shape matches expected: [1, 84, 8400]
Comparing model output with reference data...
✓ Model output matches reference data within tolerance (1e-4)!
========================================
Model test completed!
========================================
Additional context
- See PR #3599 for reference ONNX model integration and test code.
- This issue may be related to backend-specific operator behavior or model export.
Rust code: yolo11x_opset16.txt
Summary of Findings
We've identified a real bug in the ndarray backend specifically for the SiLU pattern (Conv -> x * sigmoid(x)):
-
✅ Individual operations work correctly in ndarray: - Conv2d alone: ✅ Works - Sigmoid alone: ✅ Works - SiLU (x * sigmoid(x)) alone: ✅ Works
-
❌ Combined Conv2d -> SiLU fails in ndarray: - ndarray produces max diff of 0.174135 - tch produces exact match (0.000000 diff) - The issue happens whether weights are loaded from file OR default initialized
-
This explains the YOLO11x issue: YOLO models heavily use SiLU activations after convolutions, so this bug would compound throughout the network, leading to the massive differences (693.744568) reported in issue #3600.
The bug is specifically in how ndarray handles the sequence: Conv2d -> sigmoid(conv_out) -> conv_out * sigmoid_out when they're chained together in a model. This is a critical issue that needs to be fixed in the burn-ndarray backend implementation.
https://github.com/tracel-ai/burn/pull/3604 fixes ndarray issue
But the issue remains with metal backend
I run tests comparing outputs from different layers and they seem to pass:
========================================
Critical Layer Backend Testing
========================================
Testing layers with uniform weights (triggers Conv2d bug):
📊 Linear Layer:
----------------------------------------
Linear tch vs metal: max_diff = 0.00000000
Linear tch vs ndarray: max_diff = 0.00000000
📊 Conv1d Layer:
----------------------------------------
Conv1d tch vs metal: max_diff = 0.00000000
Conv1d tch vs ndarray: max_diff = 0.00000000
📊 Conv2d Layer (YOLO config):
----------------------------------------
Conv2d tch vs metal: max_diff = 0.00005767
Conv2d tch vs ndarray: max_diff = 0.00000000
📊 ConvTranspose2d Layer:
----------------------------------------
ConvTranspose2d tch vs metal: max_diff = 0.00000000
ConvTranspose2d tch vs ndarray: max_diff = 0.00000072
📊 MaxPool2d Layer:
----------------------------------------
MaxPool2d tch vs metal: max_diff = 0.00000000
MaxPool2d tch vs ndarray: max_diff = 0.00000000
📊 AvgPool2d Layer:
----------------------------------------
AvgPool2d tch vs metal: max_diff = 0.00000000
AvgPool2d tch vs ndarray: max_diff = 0.00000000
📊 Interpolate Layer:
----------------------------------------
Interpolate(Bilinear) tch vs metal: max_diff = 0.00000638
Interpolate(Bilinear) tch vs ndarray: max_diff = 0.00000206
Interpolate(Nearest) tch vs metal: max_diff = 0.00000000
Interpolate(Nearest) tch vs ndarray: max_diff = 0.00000000
📊 Activation Functions:
----------------------------------------
Testing activations tch vs metal:
ReLU: 0.00000000
Sigmoid: 0.00000018
Tanh: 0.00000012
GELU: 0.00000024
SiLU: 0.00000072
Testing activations tch vs ndarray:
ReLU: 0.00000000
Sigmoid: 0.00000012
Tanh: 0.00000000
GELU: 0.00000048
SiLU: 0.00000048
========================================
SUMMARY
========================================
Layer | Metal vs Tch | Ndarray vs Tch | Status
--------------------|-----------------|-----------------|--------
Linear | 0.00000000 | 0.00000000 | ✅
Conv1d | 0.00000000 | 0.00000000 | ✅
Conv2d | 0.00005767 | 0.00000000 | ✅
ConvTranspose2d | 0.00000000 | 0.00000072 | ✅
MaxPool2d | 0.00000000 | 0.00000000 | ✅
AvgPool2d | 0.00000000 | 0.00000000 | ✅
Interpolate(Bilin) | 0.00000638 | 0.00000206 | ✅
Interpolate(Near) | 0.00000000 | 0.00000000 | ✅
⚠️ Threshold for failure: 0.0001
Latest findings. Use https://github.com/tracel-ai/burn/pull/3609 PR to recreate
The deep network test reveals interesting patterns:
- MLP (Linear layers): Perfect accuracy across all backends with 30 layers
- Attention Stack: Very stable with minimal error accumulation (< 0.000003)
- CNN: Moderate error growth, especially with Metal backend
- ResNet: Catastrophic error explosion with Metal backend (101M max diff!) but perfect with ndarray
The ResNet's extreme error with Metal suggests the residual connections are amplifying numerical errors exponentially. This is a critical finding for the YOLO11x model which likely uses similar skip connections.
========================================
Deep Network Error Accumulation Test
========================================
Testing how errors accumulate through deep networks
==================================================
TEST 1: Deep CNN (20 layers)
==================================================
🏗️ Building 20-layer deep CNN for tch vs metal
Input: [1, 3, 128, 128]
Layer 1 (Conv2d 3->6): max_diff = 0.00000001
Layer 2 (MaxPool2d): max_diff = 0.00000001, size now 64x64
Layer 3 (Conv2d 6->12): max_diff = 0.00000954
Layer 4 (InstanceNorm): max_diff = 0.00055087
Layer 5 (Conv2d 12->24): max_diff = 0.00078678
Layer 6 (MaxPool2d): max_diff = 0.00078678, size now 32x32
Layer 7 (Conv2d 24->48): max_diff = 0.00072074
Layer 8 (InstanceNorm): max_diff = 0.00082397
Layer 9 (Conv2d 48->96): max_diff = 0.00202739
Layer 10 (MaxPool2d): max_diff = 0.00202739, size now 16x16
Layer 11 (Conv2d 96->192): max_diff = 0.00806236
Layer 12 (InstanceNorm): max_diff = 0.00085898
Layer 13 (Conv2d 192->384): max_diff = 0.00719213
Layer 14 (MaxPool2d): max_diff = 0.00710869, size now 8x8
Layer 15 (Conv2d 384->768): max_diff = 0.05778503
Layer 16 (InstanceNorm): max_diff = 0.00021422
Layer 17 (Conv2d 384->768): max_diff = 0.00925446
Layer 19 (Conv2d 384->768): max_diff = 0.55859375
Layer 20 (InstanceNorm): max_diff = 0.00023580
🏗️ Building 20-layer deep CNN for tch vs ndarray
Input: [1, 3, 128, 128]
Layer 1 (Conv2d 3->6): max_diff = 0.00000001
Layer 2 (MaxPool2d): max_diff = 0.00000001, size now 64x64
Layer 3 (Conv2d 6->12): max_diff = 0.00000003
Layer 4 (InstanceNorm): max_diff = 0.00000516
Layer 5 (Conv2d 12->24): max_diff = 0.00000459
Layer 6 (MaxPool2d): max_diff = 0.00000453, size now 32x32
Layer 7 (Conv2d 24->48): max_diff = 0.00000763
Layer 8 (InstanceNorm): max_diff = 0.00001431
Layer 9 (Conv2d 48->96): max_diff = 0.00004864
Layer 10 (MaxPool2d): max_diff = 0.00003529, size now 16x16
Layer 11 (Conv2d 96->192): max_diff = 0.00009918
Layer 12 (InstanceNorm): max_diff = 0.00001144
Layer 13 (Conv2d 192->384): max_diff = 0.00014496
Layer 14 (MaxPool2d): max_diff = 0.00012207, size now 8x8
Layer 15 (Conv2d 384->768): max_diff = 0.00291443
Layer 16 (InstanceNorm): max_diff = 0.00001146
Layer 17 (Conv2d 384->768): max_diff = 0.00043488
Layer 19 (Conv2d 384->768): max_diff = 0.00668335
Layer 20 (InstanceNorm): max_diff = 0.00000364
📈 Error Growth Analysis for CNN (Metal):
Initial error: 0.00000001
Final error: 0.00023580
Total growth: 16878.93x
Max growth: 682.67x at layer 3
⚠️ WARNING: Exponential error growth detected! (avg: 1.67x per layer)
📈 Error Growth Analysis for CNN (Ndarray):
Initial error: 0.00000001
Final error: 0.00000364
Total growth: 300.31x
Max growth: 153.78x at layer 4
⚠️ WARNING: Exponential error growth detected! (avg: 1.35x per layer)
==================================================
TEST 2: ResNet-like (10 residual blocks = ~20 layers)
==================================================
🏗️ Building 10-block ResNet-like network for tch vs metal
Block 1: max_diff = 0.00000030
Block 2: max_diff = 0.00002766
Block 3: max_diff = 0.00126648
Block 4: max_diff = 0.06152344
Block 5: max_diff = 1.56250000
Block 6: max_diff = 56.50000000
Block 7: max_diff = 2160.00000000
Block 8: max_diff = 100864.00000000
Block 9: max_diff = 2998272.00000000
Block 10: max_diff = 101711872.00000000
🏗️ Building 10-block ResNet-like network for tch vs ndarray
Block 1: max_diff = 0.00000000
Block 2: max_diff = 0.00000000
Block 3: max_diff = 0.00000000
Block 4: max_diff = 0.00000000
Block 5: max_diff = 0.00000000
Block 6: max_diff = 0.00000000
Block 7: max_diff = 0.00000000
Block 8: max_diff = 0.00000000
Block 9: max_diff = 0.00000000
Block 10: max_diff = 0.00000000
📈 Error Growth Analysis for ResNet (Metal):
Initial error: 0.00000030
Final error: 101711872.00000000
Total growth: 341288402550784.00x
Max growth: 92.80x at layer 2
⚠️ WARNING: Exponential error growth detected! (avg: 28.40x per layer)
📈 Error Growth Analysis for ResNet (Ndarray):
Initial error: 0.00000000
Final error: 0.00000000
Total growth: 0.00x
Max growth: 0.00x at layer 1
✅ Error is stable or decreasing
==================================================
TEST 3: Deep MLP (30 layers)
==================================================
🏗️ Building 30-layer deep MLP for tch vs metal
Layer 1 (Linear + ReLU): max_diff = 0.00000000
Layer 2 (Linear + GELU): max_diff = 0.00000000
Layer 3 (Linear + ReLU): max_diff = 0.00000000
Layer 4 (Linear + GELU): max_diff = 0.00000000
Layer 5 (Linear + ReLU): max_diff = 0.00000000
Layer 6 (Linear + GELU): max_diff = 0.00000000
Layer 7 (Linear + ReLU): max_diff = 0.00000000
Layer 8 (Linear + GELU): max_diff = 0.00000000
Layer 9 (Linear + ReLU): max_diff = 0.00000000
Layer 10 (Linear + GELU): max_diff = 0.00000000
Layer 11 (Linear + ReLU): max_diff = 0.00000000
Layer 12 (Linear + GELU): max_diff = 0.00000000
Layer 13 (Linear + ReLU): max_diff = 0.00000000
Layer 14 (Linear + GELU): max_diff = 0.00000000
Layer 15 (Linear + ReLU): max_diff = 0.00000000
Layer 16 (Linear + GELU): max_diff = 0.00000000
Layer 17 (Linear + ReLU): max_diff = 0.00000000
Layer 18 (Linear + GELU): max_diff = 0.00000000
Layer 19 (Linear + ReLU): max_diff = 0.00000000
Layer 20 (Linear + GELU): max_diff = 0.00000000
Layer 21 (Linear + ReLU): max_diff = 0.00000000
Layer 22 (Linear + GELU): max_diff = 0.00000000
Layer 23 (Linear + ReLU): max_diff = 0.00000000
Layer 24 (Linear + GELU): max_diff = 0.00000000
Layer 25 (Linear + ReLU): max_diff = 0.00000000
Layer 26 (Linear + GELU): max_diff = 0.00000000
Layer 27 (Linear + ReLU): max_diff = 0.00000000
Layer 28 (Linear + GELU): max_diff = 0.00000000
Layer 29 (Linear + ReLU): max_diff = 0.00000000
Layer 30 (Linear + GELU): max_diff = 0.00000000
🏗️ Building 30-layer deep MLP for tch vs ndarray
Layer 1 (Linear + ReLU): max_diff = 0.00000000
Layer 2 (Linear + GELU): max_diff = 0.00000000
Layer 3 (Linear + ReLU): max_diff = 0.00000000
Layer 4 (Linear + GELU): max_diff = 0.00000000
Layer 5 (Linear + ReLU): max_diff = 0.00000001
Layer 6 (Linear + GELU): max_diff = 0.00000000
Layer 7 (Linear + ReLU): max_diff = 0.00000000
Layer 8 (Linear + GELU): max_diff = 0.00000000
Layer 9 (Linear + ReLU): max_diff = 0.00000002
Layer 10 (Linear + GELU): max_diff = 0.00000001
Layer 11 (Linear + ReLU): max_diff = 0.00000002
Layer 12 (Linear + GELU): max_diff = 0.00000001
Layer 13 (Linear + ReLU): max_diff = 0.00000001
Layer 14 (Linear + GELU): max_diff = 0.00000001
Layer 15 (Linear + ReLU): max_diff = 0.00000001
Layer 16 (Linear + GELU): max_diff = 0.00000001
Layer 17 (Linear + ReLU): max_diff = 0.00000001
Layer 18 (Linear + GELU): max_diff = 0.00000002
Layer 19 (Linear + ReLU): max_diff = 0.00000003
Layer 20 (Linear + GELU): max_diff = 0.00000001
Layer 21 (Linear + ReLU): max_diff = 0.00000000
Layer 22 (Linear + GELU): max_diff = 0.00000000
Layer 23 (Linear + ReLU): max_diff = 0.00000001
Layer 24 (Linear + GELU): max_diff = 0.00000001
Layer 25 (Linear + ReLU): max_diff = 0.00000001
Layer 26 (Linear + GELU): max_diff = 0.00000000
Layer 27 (Linear + ReLU): max_diff = 0.00000001
Layer 28 (Linear + GELU): max_diff = 0.00000000
Layer 29 (Linear + ReLU): max_diff = 0.00000000
Layer 30 (Linear + GELU): max_diff = 0.00000000
📈 Error Growth Analysis for MLP (Metal):
Initial error: 0.00000000
Final error: 0.00000000
Total growth: 0.00x
Max growth: 0.00x at layer 1
✅ Error is stable or decreasing
📈 Error Growth Analysis for MLP (Ndarray):
Initial error: 0.00000000
Final error: 0.00000000
Total growth: 0.00x
Max growth: 48.00x at layer 5
✅ Error is stable or decreasing
==================================================
TEST 4: Attention Stack (12 layers - like BERT)
==================================================
🏗️ Building 12-layer attention stack for tch vs metal
Layer 1 (Self-Attention + LayerNorm): max_diff = 0.00000036
Layer 2 (Self-Attention + LayerNorm): max_diff = 0.00000107
Layer 3 (Self-Attention + LayerNorm): max_diff = 0.00000125
Layer 4 (Self-Attention + LayerNorm): max_diff = 0.00000161
Layer 5 (Self-Attention + LayerNorm): max_diff = 0.00000158
Layer 6 (Self-Attention + LayerNorm): max_diff = 0.00000153
Layer 7 (Self-Attention + LayerNorm): max_diff = 0.00000150
Layer 8 (Self-Attention + LayerNorm): max_diff = 0.00000137
Layer 9 (Self-Attention + LayerNorm): max_diff = 0.00000137
Layer 10 (Self-Attention + LayerNorm): max_diff = 0.00000136
Layer 11 (Self-Attention + LayerNorm): max_diff = 0.00000134
Layer 12 (Self-Attention + LayerNorm): max_diff = 0.00000133
🏗️ Building 12-layer attention stack for tch vs ndarray
Layer 1 (Self-Attention + LayerNorm): max_diff = 0.00000024
Layer 2 (Self-Attention + LayerNorm): max_diff = 0.00000095
Layer 3 (Self-Attention + LayerNorm): max_diff = 0.00000119
Layer 4 (Self-Attention + LayerNorm): max_diff = 0.00000149
Layer 5 (Self-Attention + LayerNorm): max_diff = 0.00000176
Layer 6 (Self-Attention + LayerNorm): max_diff = 0.00000173
Layer 7 (Self-Attention + LayerNorm): max_diff = 0.00000209
Layer 8 (Self-Attention + LayerNorm): max_diff = 0.00000221
Layer 9 (Self-Attention + LayerNorm): max_diff = 0.00000262
Layer 10 (Self-Attention + LayerNorm): max_diff = 0.00000324
Layer 11 (Self-Attention + LayerNorm): max_diff = 0.00000316
Layer 12 (Self-Attention + LayerNorm): max_diff = 0.00000298
📈 Error Growth Analysis for Attention (Metal):
Initial error: 0.00000036
Final error: 0.00000133
Total growth: 3.71x
Max growth: 3.00x at layer 2
⚠️ WARNING: Exponential error growth detected! (avg: 1.12x per layer)
📈 Error Growth Analysis for Attention (Ndarray):
Initial error: 0.00000024
Final error: 0.00000298
Total growth: 12.50x
Max growth: 4.00x at layer 2
⚠️ WARNING: Exponential error growth detected! (avg: 1.23x per layer)
==================================================
FINAL SUMMARY
==================================================
Final errors after full depth:
Architecture | Layers | Metal Error | Ndarray Error | Status
----------------|--------|--------------|---------------|--------
CNN | 20 | 0.00023580 | 0.00000364 | ✅
ResNet | ~20 | 101711872.00000000 | 0.00000000 | ❌
MLP | 30 | 0.00000000 | 0.00000000 | ✅
Attention | 12 | 0.00000133 | 0.00000298 | ✅
⚠️ Error threshold for failure: 1.00%
🎯 Sensitivity Analysis:
Most error-prone: ResNet (max error: 101711872.00000000)
Most stable: MLP (max error: 0.00000000)
========================================
CC: @laggui @nathanielsimard @louisfd @wingertge
I submitted a PR to recreate the issue.