faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

Question about Tensor Input Size Changes in Version 1.0.0

Open kale4eat opened this issue 1 year ago • 4 comments

Hello developers. I appreciate all your efforts to improve this software.

Now, I noticed that the transcription behavior has changed a lot in version 1.0.0.

I found out that the size of the Tensor input to the model is different. In other words, encode output is different from the previous one, so the result of generate is also different. This may be related to the quality of the transcription.

The following code from openai's Whisper shows that the last dimension of mel_segment is padded to be N_FRAMES. https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L276

Therefore, I wonder if the same process as the function pad_or_trim is needed in this repository?

Note: The environment I checked is as follows. OS: Windows 10 Python: 3.9.13

kale4eat avatar Feb 24 '24 01:02 kale4eat

@kale4eat , hello. Can you check this https://github.com/SYSTRAN/faster-whisper/pull/705 for quick fix. I will try to implement the pad_or_trim function. Tks.

trungkienbkhn avatar Feb 24 '24 02:02 trungkienbkhn

Now, I noticed that the transcription behavior has changed a lot in version 1.0.0.

I noticed it only happening in the last chunk. [I tested with old PyAV and CTranslate2] Maybe your observed differences are coming from new CTranslate2 or new PyAV. (read there about PyAV)

The following code from openai's Whisper shows that the last dimension of mel_segment is padded to be N_FRAMES. https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L276

Not sure what it does, but it's there from the word-level timestamps implementation, maybe it's not needed here.

Maybe you actually meant this line - https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L274 , it was a change incorporated in 1.0.0 with "clip_timestamps" option.

Purfview avatar Feb 24 '24 02:02 Purfview

@kale4eat Actually, I forgot that https://github.com/SYSTRAN/faster-whisper/commit/00efce1696c21310bbdfd58433adfc8d44c2edbc & https://github.com/SYSTRAN/faster-whisper/commit/ebcfd6b9646f5176fba8b7f3429d0de28a70192c bugfixes were made after the 0.10.0 version.

So, differences can come from these too if you use the repo by versions.

Purfview avatar Feb 24 '24 04:02 Purfview

@Purfview Thanks for the info. Seems the situation is more complicated than I had imagined. #705 would return inference results almost close to the previous version. (In my case, I don't specify many options and use short pre-processed audio files. ) I'll keep a close eye on the impact of the other changes.

kale4eat avatar Feb 24 '24 05:02 kale4eat