Arabic Transcription
I did all the steps to generate the tflite and bin files, and included the decoder id
forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")
Arabic start to show up but with 50% missing letters
Mel spectrogram is calculated...!
2024-12-13 13:00:37.722 17057-17091 WhisperEngineJava com.whispertflite D output_len: 451
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava com.whispertflite D Skipping token: 50258, word: <|startoftranscript|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava com.whispertflite D Skipping token: 50272, word: <|ar|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava com.whispertflite D It is Transcription...
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava com.whispertflite D Skipping token: 50359, word: <|transcribe|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava com.whispertflite D Skipping token: 50363, word: <|notimestamps|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava com.whispertflite D Adding token: 21136, word: ĠاÙĦس
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava com.whispertflite D Adding token: 37440, word: ÙĦاÙħ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava com.whispertflite D Adding token: 25894, word: ĠعÙĦÙĬ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava com.whispertflite D Adding token: 24793, word: ÙĥÙħ
2024-12-13 13:00:37.725 17057-17091 WhisperEngineJava com.whispertflite D Inference is executed...!
2024-12-13 13:00:37.726 17057-17091 MainActivity com.whispertflite D Result: ?ا�?س�?ا�??ع�?�?�?�?
I chatgpt the problem and reached to this point, but I can't do progress any any more. I think it's not related to unicode issue, more likely the way the vocabulary file ignoring 50% of Arabic chars , I also tried using the files in py but I didn't manage to see any Arabic text at all
Can you try with base or small model?
https://github.com/woheller69/whisperIME/issues/52 Try the version from the issue here. I think I found the reason for the ignored characters
There is issue in post processing code in whisper_java app. But, this problem is not in whisper_native.
in my java app it is fixed now in the provided beta. The problem was that the tokens were treated as strings and these strings were combined. The solution is to treat the tokens as byte[], combine these and create the result string in the end. This fixes issues with Chinese and Korean and I guess it also fixes this issue.