whisper_android Arabic Transcription

I did all the steps to generate the tflite and bin files, and included the decoder id

forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")

Arabic start to show up but with 50% missing letters

Mel spectrogram is calculated...!
2024-12-13 13:00:37.722 17057-17091 WhisperEngineJava       com.whispertflite                    D  output_len: 451
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50258, word: <|startoftranscript|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50272, word: <|ar|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  It is Transcription...
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50359, word: <|transcribe|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50363, word: <|notimestamps|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 21136, word: ĠØ§ÙĦØ³
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 37440, word: ÙĦØ§Ùħ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 25894, word: ĠØ¹ÙĦÙĬ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 24793, word: ÙĥÙħ
2024-12-13 13:00:37.725 17057-17091 WhisperEngineJava       com.whispertflite                    D  Inference is executed...!
2024-12-13 13:00:37.726 17057-17091 MainActivity            com.whispertflite                    D  Result: ?ا�?س�?ا�??ع�?�?�?�?

I chatgpt the problem and reached to this point, but I can't do progress any any more. I think it's not related to unicode issue, more likely the way the vocabulary file ignoring 50% of Arabic chars , I also tried using the files in py but I didn't manage to see any Arabic text at all

Dec 14 '24 14:12 doit-ceo

Can you try with base or small model?

Dec 18 '24 04:12 vilassn

https://github.com/woheller69/whisperIME/issues/52 Try the version from the issue here. I think I found the reason for the ignored characters

Mar 14 '25 11:03 woheller69

There is issue in post processing code in whisper_java app. But, this problem is not in whisper_native.

Mar 14 '25 11:03 vilassn

in my java app it is fixed now in the provided beta. The problem was that the tokens were treated as strings and these strings were combined. The solution is to treat the tokens as byte[], combine these and create the result string in the end. This fixes issues with Chinese and Korean and I guess it also fixes this issue.

Mar 14 '25 11:03 woheller69