Question about Tensor Input Size Changes in Version 1.0.0
Hello developers. I appreciate all your efforts to improve this software.
Now, I noticed that the transcription behavior has changed a lot in version 1.0.0.
I found out that the size of the Tensor input to the model is different. In other words, encode output is different from the previous one, so the result of generate is also different. This may be related to the quality of the transcription.
The following code from openai's Whisper shows that the last dimension of mel_segment is padded to be N_FRAMES. https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L276
Therefore, I wonder if the same process as the function pad_or_trim is needed in this repository?
Note: The environment I checked is as follows. OS: Windows 10 Python: 3.9.13
@kale4eat , hello. Can you check this https://github.com/SYSTRAN/faster-whisper/pull/705 for quick fix. I will try to implement the pad_or_trim function. Tks.
Now, I noticed that the transcription behavior has changed a lot in version 1.0.0.
I noticed it only happening in the last chunk. [I tested with old PyAV and CTranslate2] Maybe your observed differences are coming from new CTranslate2 or new PyAV. (read there about PyAV)
The following code from openai's Whisper shows that the last dimension of mel_segment is padded to be N_FRAMES. https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L276
Not sure what it does, but it's there from the word-level timestamps implementation, maybe it's not needed here.
Maybe you actually meant this line - https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/transcribe.py#L274 , it was a change incorporated in 1.0.0 with "clip_timestamps" option.
@kale4eat Actually, I forgot that https://github.com/SYSTRAN/faster-whisper/commit/00efce1696c21310bbdfd58433adfc8d44c2edbc & https://github.com/SYSTRAN/faster-whisper/commit/ebcfd6b9646f5176fba8b7f3429d0de28a70192c bugfixes were made after the 0.10.0 version.
So, differences can come from these too if you use the repo by versions.
@Purfview Thanks for the info. Seems the situation is more complicated than I had imagined. #705 would return inference results almost close to the previous version. (In my case, I don't specify many options and use short pre-processed audio files. ) I'll keep a close eye on the impact of the other changes.