espnet_onnx
espnet_onnx copied to clipboard
ONNX Model Fail to run
Hi have exported the espnet model trained on my custom dataset using espnet_onnx. Model fails to work properly on some audios. Below is the error which i am getting
Non-zero status code returned while running Add node. Name:'/encoders/encoders.0/self_attn/Add' Status Message: /encoders/encoders.0/self_attn/Add: right operand cannot broadcast on dim 3 LeftShape: {1,8,171,171}, RightShape: {1,1,1,127}
Any idea what could be the issue here. I have infered model on 1500 audio clips and i am getting exactly same error on around 400 audio clips.
Hi @rajeevbaalwan I would like to confirm some points:
- Would you tell me which encoder you use in your model?
- Did you observe any similarities between them?
Hi @rajeevbaalwan I would like to confirm some points:
- Would you tell me which encoder you use in your model?
- Did you observe any similarities between them?
Thanks @Masao-Someki for your reply. I have used a simple transformer encoder. I didn't get your question regarding similarity. Do you want to know the similarity between error outputs or something else ?
@Masao-Someki I have tried with conformer encoder based ASR model also but getting same error.
2023-10-08 23:12:29.048358681 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running Add node. Name:'/encoders/encoders.0/self_attn/Add_5' Status Message: /encoders/encoders.0/self_attn/Add_5: right operand cannot broadcast on dim 3 LeftShape: {1,8,187,187}, RightShape: {1,1,1,127}
@rajeevbaalwan
The node /encoders/encoders.0/self_attn/Add
is the masking process. I think increasing the max_seq_len
will fix this issue!
tag_name = 'your model'
m = ASRModelExport()
# Add the following export config
m.set_export_config(
max_seq_len=5000,
)
m.export_from_pretrained(tag_name, quantize=False, optimize=False)
In the masking process, your input audio seems to have a 171 frame length, while the mask has a 127 frame length. This difference causes this issue. The frame length is estimated during the onnx inference, but the maximum frame length is limited to the max_seq_len
value. So increasing this value might fix this problem.
@Masao-Someki Thanks it worked for me. But the exported ONNX models do not work with batch input, right ? It only works for a single audio clip.
@rajeevbaalwan Yes, it does not work with batched input.
If you want to run batched inference, then you need to:
- Add the dynamic axes for batch dimension in the below script.
- Fix the inference function.
https://github.com/espnet/espnet_onnx/blob/7cd0f78ed56b1243005aca671a78e620883bb989/espnet_onnx/export/asr/models/encoders/transformer.py#L105-L106
@Masao-Someki Thanks for the reply. I have already made the changes in the dynamic axes but this only won't solve the problem as the forward function only takes feats and not the actual length of the inputs in the batch, that's why enc_out_length is always wrong for the batch input as features length is calculated as below
feats_length = torch.ones(feats[:, :, 0].shape).sum(dim=-1).type(torch.long)
Is there any plan to handle batch inference during ONNX export in espnet_onnx? The complete inference function needs to be changed. If espnet_onnx is supposed to be implemented to prepare models for production then batch inferencing support is a must in the exported models. Single clip inference won't help in production.
@rajeevbaalwan Sorry for the inconvenience, but currently we have no plan to support batch inference. We have investigated the speed up with batched inference in our paper by tring to apply onnx hubert for training, but onnx seems to be less effective with large batch size.
@Masao-Someki You are absolutely right ONNX exported do not give huge speed up for large batch sizes but for small batch size like 4, 8, etc it is better than single clip inferencing. So it is better to have GPU-based implementation as it will be the generic implementation that will work for both single clip as well as multiple clips so that the user can have the flexibility. Event batch implementation doesn't degrade the performance for single clip inference. So can you take this feature into consideration?
@Masao-Someki is ESPnetLanguageModel is support in ONNX?
@rajeevbaalwan I assume that the user of this library is more like an individual who wants to execute the ESPnet model on a low-resource constraint, such as Raspi. If the inference with the onnx format does not provide enough speedup, then we don't need ESPnet-ONNX, we can just use GPU. Of course, I know having a multiple-batch inference option may be better, but I don't think it is worth implementing here.
is ESPnetLanguageModel is support in ONNX?
Yes, you can include an external language model.
@rajeevbaalwan I assume that the user of this library is more like an individual who wants to execute the ESPnet model on a low-resource constraint, such as Raspi. If the inference with the onnx format does not provide enough speedup, then we don't need ESPnet-ONNX, we can just use GPU. Of course, I know having a multiple-batch inference option may be better, but I don't think it is worth implementing here.
is ESPnetLanguageModel is support in ONNX?
Yes, you can include an external language model.
@Masao-Someki I can't find the code to export the Language Model in ONNX in the repo.
@rajeevbaalwan In the following line, ESPnet-onnx has export function for language models! https://github.com/espnet/espnet_onnx/blob/d617487a12e186f5240a74121f88af328fef2f02/espnet_onnx/export/asr/export_asr.py#L113-L126