burn icon indicating copy to clipboard operation
burn copied to clipboard

Bug with Gather Node

Open akshitgaur2005 opened this issue 8 months ago • 3 comments

I am trying to import Depth-Anything-v2 in Burn through ONNX but burn-import fails on a particular Gather Node

  DEBUG onnx_ir::from_onnx: renaming node "/blocks.0/attn/Gather_3"    
  ERROR burn_import::logger: PANIC => panicked at /home/akshit/storage/projects/burn/crates/onnx-ir/src/rank_inference.rs:884:31:
  attempt to subtract with overflow   

To Reproduce

  1. Download this model - https://github.com/fabio-sim/Depth-Anything-ONNX/releases/download/v2.0.0/depth_anything_v2_vits_dynamic.onnx
  2. Try to import it in Burn

The error occurs on crates/onnx-ir/src/rank-inferences.rs - L884

 let output_rank = indices_rank + input_tensor.rank - 1;

The inputs to /blocks.0/attn/Gather_3 are-

Axis: 0

Data:
name: /blocks.0/attn/Transpose_output_0 tensor: float32[3,batch_size,6,floor(height/14)*floor(width/14) + 1,64]

Indices: name: /Constant_5_output_0 category: Initializer tensor: int64 0

akshitgaur2005 avatar Mar 19 '25 17:03 akshitgaur2005

The input_tensor.rank is incorrectly inferred as being 0 here. This happened because of a preceding Reshape node

Image

Reshape takes the output shape has input, but with the current state of onnx-ir we don't really capture the adequate info. The shape field for almost all tensors is not populated (i.e., None). So when trying to infer the rank of the output for a Reshape operation, we need to know the number of elements in the shape input, but we probably don't have it.

And actually, the current implementation just checks for constant inputs. So even if the shape attribute was available, it is not propagated.

https://github.com/tracel-ai/burn/blob/5d16339e6f74b857391da2e44564de6764a07d1a/crates/onnx-ir/src/rank_inference.rs#L348-L368

The result: the output rank is set as 0. But it's incorrect.

laggui avatar Apr 07 '25 15:04 laggui

Ohk, so if I understand correctly this seems to be deeper issue that would not be easily solved by a few patches.

Should I just get started with writing the pytorch import code then? Or try to solve it?

akshitgaur2005 avatar Apr 08 '25 14:04 akshitgaur2005

Ohk, so if I understand correctly this seems to be deeper issue that would not be easily solved by a few patches.

Yeah this is somewhat of a limitation with the current IR. The shapes are almost always unnecessary, so they were left empty because only the rank is required for burn tensors. But the Reshape node is an exception that is not accounted for. So it breaks apart because of the previous assumption. The rank of the output tensor from a Reshape operation is determined by the number of elements in the 1D shape input (i.e., the size of that dimension).

You might be able to fix the issue for this specific model by ensuring that the shape for that input is available up to this point. But this might require capturing shapes (not just rank) for previous operations. Might be manageable if you can narrow it down to a couple of nodes, but you can see how this doesn't scale 😅

That's a big reason why we'd like to rework how shapes are handled.

Should I just get started with writing the pytorch import code then? Or try to solve it?

Totally up to you 🙂 two different approaches

laggui avatar Apr 08 '25 14:04 laggui

This has been fixed by https://github.com/tracel-ai/burn/pull/3381

antimora avatar Aug 18 '25 23:08 antimora