Bug in shape inference
would like to remove the dependency on libonnxruntime, but I can't load the silero speech to text model in tract. The model can be downloaded here and my rust tool for transcribing audio/video files can be found here.
[2023-10-15T16:03:08.414068604Z ERROR tract] Error at stage analyse
Caused by:
0: Failed analyse for node #395 "Slice_152" StridedSlice
1: Failed analyse for node #395 "Slice_152" StridedSlice
2: Infering facts
3: Applying rule outputs[0].shape == batch,1+(samples+639)/640,512
4: Unifying shapes batch,1+(samples+639)/640,1536 and batch,1+(samples+639)/640,512
5: Impossible to unify Val(1536) with Val(512).
This model presents many challenges. Tensors variable that vary in rank (i.e. number of axes).
┣┻ 279 If If_33
┃ [then] ┏ 2 Source 380
┃ [then] ┃ ━━━ batch,1,1,samples+320,F32
┃ [then] ┣┻ 1 Squeeze13 Squeeze_35
┃ [then] ━━━ batch,1,samples+320,F32
┃ [else] ┏ 1 Source 380
┃ [else] ┃ ━━━ batch,1,1,samples+320,F32
┃ [else] ┣ 0 Identity Identity_36
┃ [else] ━━━ batch,1,1,samples+320,F32
The then branch of the If is of rank 3, then else branch of rank 4. tract (wrongly) assumes the If output is of rank 3, but this may not be always the case. This approach is relatively rare, but one of silero's trademarks. Usually it is relatively easy to re-express the network to avoid varying tensor rank.
The issue you've encountered is downstream of this If, but I'm not sure what is the root cause. This model is huge and I have no expertise on it. For small network, I can often figure out myself where the model starts misbehaving, but this is avery big one. Can you look at tract dump output and try and pinpoint where tract gets the shapes right and where it starts misbahaving ?