Kamil Akesbi comments

Results 14 comments of


                                            Kamil Akesbi

AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction

Hi @Hubert-Bonisseur, Thanks for sharing this issue! - Remark 1 was solved in PR #30325. - Regarding Remark 2: Long-form generation indeed requires timestamps to chunk the audios so this...

Evaluate trainer on Code-Switched Speech fails with "ValueError: Multiple languages detected when trying to predict the most likely target language for transcription."

Hi @sproocht, Thanks for sharing this error! It will be solved with PR #29688.

[WIP] - Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size

I think this PR is ready to be merged! cc @amyeroberts @gante if you want to have a look ;)

Fix WhisperForConditionalGeneration to respect generation_config?

Hi @mizoru, Thanks for iterating on this! Could you please open an issue with a min reproducer of the error you get before making these changes?

Wrong calculation of the step size for the overlapping inference in the distill whisper model

Hi @systemdevart, Thank you for this question! Here, `stride_left` indicates the overlap between the current and left chunk when already considering that `stride_right` samples are not in the left chunk...

Whisper Word-level Timestamps broken on some inputs

It was indeed solved with #30325, I'm closing for now!

Whisper do_sample through generation_config and generate() give different results

Hi @udeepam, thanks for this issue and the clear reproducer! On the latest version of the main branch ( `transformers 4.40.0.dev0`), I get the same results with and without `generation_config`,...

Fix WhisperForConditionalGeneration to respect generation_config?

It will be solved by PR #31296 :)

The whisper-large-v3 model randomly misses sentences during recognition when return_timestamps="word"

Hi @zxl777, Thanks for this issue! The provided audio is longer than 30 seconds. In this case, you can choose to: - Use batched inference by chunking the input audio...

Whisper - get probability of detected language

Hi @hanif-rt, this should be solved with PR #31572 :)