wenet
wenet copied to clipboard
[runtime] Configurable blank token idx
Following https://github.com/wenet-e2e/wenet/pull/2320 which makes feature extraction pipeline compatible with whisper. We realized blank token idx is hard coded https://github.com/wenet-e2e/wenet/issues/2329, resulting in weird decoding results in the previous PR.
This PR is aiming at fixing that. There might be some places that i missed, can you folks help me do another pass?
Current RTF is 2.xx after warming up, using one core with beam size of 8, on my local macbook. I think the model is probably capable of running at real time with avx512 on multiple cores, even with whisper large. so let's get it to work!
However, there is still some loose ends. Maybe you guys already know the answer and can share some insights.
- decoder main result still slightly differs from transcribe.py imitating streaming inference using attention masks. I don't know exactly why, maybe some default params are different?
- In order to create the decoding TLG graph, some tools might needs to be updated to support flexible blank token id. I haven't checked this one, but now this is not a high priority for us.