whisper_android icon indicating copy to clipboard operation
whisper_android copied to clipboard

Arabic Transcription

Open doit-ceo opened this issue 1 year ago • 4 comments

I did all the steps to generate the tflite and bin files, and included the decoder id

forced_decoder_ids = processor.get_decoder_prompt_ids(language="ar", task="transcribe")

Arabic start to show up but with 50% missing letters

Mel spectrogram is calculated...!
2024-12-13 13:00:37.722 17057-17091 WhisperEngineJava       com.whispertflite                    D  output_len: 451
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50258, word: <|startoftranscript|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50272, word: <|ar|>
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  It is Transcription...
2024-12-13 13:00:37.723 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50359, word: <|transcribe|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Skipping token: 50363, word: <|notimestamps|>
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 21136, word: ĠاÙĦس
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 37440, word: ÙĦاÙħ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 25894, word: ĠعÙĦÙĬ
2024-12-13 13:00:37.724 17057-17091 WhisperEngineJava       com.whispertflite                    D  Adding token: 24793, word: ÙĥÙħ
2024-12-13 13:00:37.725 17057-17091 WhisperEngineJava       com.whispertflite                    D  Inference is executed...!
2024-12-13 13:00:37.726 17057-17091 MainActivity            com.whispertflite                    D  Result: ?ا�?س�?ا�??ع�?�?�?�?

I chatgpt the problem and reached to this point, but I can't do progress any any more. I think it's not related to unicode issue, more likely the way the vocabulary file ignoring 50% of Arabic chars , I also tried using the files in py but I didn't manage to see any Arabic text at all

doit-ceo avatar Dec 14 '24 14:12 doit-ceo

Can you try with base or small model?

vilassn avatar Dec 18 '24 04:12 vilassn

https://github.com/woheller69/whisperIME/issues/52 Try the version from the issue here. I think I found the reason for the ignored characters

woheller69 avatar Mar 14 '25 11:03 woheller69

There is issue in post processing code in whisper_java app. But, this problem is not in whisper_native.

vilassn avatar Mar 14 '25 11:03 vilassn

in my java app it is fixed now in the provided beta. The problem was that the tokens were treated as strings and these strings were combined. The solution is to treat the tokens as byte[], combine these and create the result string in the end. This fixes issues with Chinese and Korean and I guess it also fixes this issue.

woheller69 avatar Mar 14 '25 11:03 woheller69