IPED #1823 whisper transcription

When finished this will close #1823.

Already tested on CPU. I still need to test on GPU, test the remote service and verify Wav2Vec2 backwards compatibility.

Apr 12 '24 23:04 lfcnassif

I think this is finished. @marcus6n, I would appreciate very much if you could test this on Monday, thank you.

Apr 13 '24 19:04 lfcnassif

@lfcnassif Yes, I can test it!

Apr 13 '24 22:04 marcus6n

@lfcnassif I've run the tests and everything is working properly.

Apr 15 '24 17:04 marcus6n

I was waiting for this PR. Thank you. I will test this PR with GPU CUDA. I also found some audios that had encoding problems in the transcription. I'll test them too.

@lfcnassif , a suggestion. I had also suggested how to calculate the finalscore in Python with "numpy.average(probs)", but numpy.average is the weighted average, as it has no weighting and is not passed as a parameter, it is the same as numpy.mean. Maybe it's a little faster...

Another thing, does this PR also close issue #1335?

Apr 15 '24 19:04 gfd2020

I will test this PR with GPU CUDA. I also found some audios that had encoding problems in the transcription. I'll test them too.

Hi @gfd2020! Additional tests will be very welcome!

@lfcnassif , a suggestion. I had also suggested how to calculate the finalscore in Python with "numpy.average(probs)", but numpy.average is the weighted average, as it has no weighting and is not passed as a parameter, it is the same as numpy.mean. Maybe it's a little faster...

I took the final score computation from your previous code suggestion, thank you! Good to know, we can replace the function, but I think the time difference will not be noticeable.

Another thing, does this PR also close issue https://github.com/sepinf-inc/IPED/issues/1335?

No, I'll keep it open, since I didn't finish all my planned tests. I'm integrating this because some users asked for it. Beyond Whisper.cpp which improved a lot in the last months and added full CUDA support, I also found WhisperX (which uses Faster-Whisper under the hood) and Insanely-Faster-Whisper. Those last 2 libs break long audios into 30s parts and executes batch inference on the audio segments simultaneously, resulting in up to 10x speed up because of batching, at the cost of increased GPU memory usage. I did a quick test with them and they are really really fast for long audios indeed! But their approach can decrease the final accuracy, since default Whisper algorithm uses previous transcribed tokens to help transcribing the next ones. AFAIK, those libraries break the audio in parts and the transcription is done independently on the 30s audio segments. As I didn't measure WER for those libraries yet, I'm concerned about integrating them. If they could accept many different audios as input and transcribe them using batch inference instead of breaking the audios, that would be a safer approach. But that would require more work from our side, to group audios with similar duration before transcription, decide waiting or not to group audios, signal last audio, etc.

Apr 15 '24 21:04 lfcnassif

Using float16 precision instead of int8 gave almost a 50% speed up on RTX 3090.

Apr 15 '24 23:04 lfcnassif

Using float16 precision instead of int8 gave almost a 50% speed up on RTX 3090.

On CPU too?

Apr 15 '24 23:04 gfd2020

On CPU too?

Possibly not, I'll check and report back.

Apr 15 '24 23:04 lfcnassif

@gfd2020 thanks for asking about the effect of float16 on CPU. Actually it doesn't work on CPU at all, just pushed commit fixing it. About float32 x int8 speed on CPU, testing with ~160 audios on 48 threads CPU, medium Whisper model:

float32 took 1287s
int8 took 1134s

Apr 16 '24 01:04 lfcnassif

Speed numbers of other implementations over a single 442s audio using 1 RTX 3090, medium model, float16 precision (except whisper.cpp, it couldn't be defined):

Faster-Whisper took ~36s
Whisper.cpp took ~31s
Insanely-Fast-Whisper took ~7s
WhisperX took ~5s

Running over the 160 real world small audios dataset above (total duration of 2758s):

Faster-Whisper took 220s
Whisper.cpp 185s
Insanely-Fast-Whisper 358s
WhisperX took 171s

PS: Whisper.cpp seems to parallelize better than others using multiple processes, so its last number could be improved. PS2: For inference on CPU, Whisper.cpp is faster than Faster-Whisper by ~35%, not sure if I will time all of them on CPU... PS3: Using large-v3 model within Whisper.cpp, it produced hallucinations (repeated texts and a few non existing texts), it was also observed with Faster-Whisper in a lower level.

Apr 16 '24 03:04 lfcnassif

Hi, @lfcnassif

I don't have a very powerful GPU but it has a tensor cores and the following error occurred: "Requested float16 compute type, but the target device or backend does not support efficient float16 computation."

So I changed it to float32 and it gave the following error: "CUDA failed with error out of memory"

finally, change to int8 and worked fine on GPU.

So, I have two suggestions:

Print the error message if is change to computing on the CPU.
Leave int8 as the default and use compute type as a parameter on audiotranscripttask.txt

I'm still doing other tests

Apr 16 '24 16:04 gfd2020

So, I have two suggestions:

Print the error message if is change to computing on the CPU.

Leave int8 as the default and use compute type as a parameter on audiotranscripttask.txt

Thanks for testing @gfd2020! Both are good suggestions and I was already planning to externalize the compute_type (precision) parameter, and also the batch_size if we switch to WhisperX, I'm running accuracy tests and should post the results soon. About the float16 not supported, what is your CUDA Toolkit installed version?

Apr 16 '24 16:04 lfcnassif

Thanks for testing @gfd2020! Both are good suggestions and I was already planning to externalize the compute_type (precision) parameter, and also the batch_size if we switch to WhisperX, I'm running accuracy tests and should post the results soon. About the float16 not supported, what is your CUDA Toolkit installed version?

NVIDIA CUDA 11.7.99 driver on Quadro P620 torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1

This was the only version that I managed to make work on these weaker GPUs (Quadro P620 and T400)

Apr 16 '24 16:04 gfd2020

I don't know if it's a proxy failure, but I couldn't download the 'medium' model. It starts to download and gives an error with 150 MB. So I'm using the medium model created in the topic https://github.com/sepinf-inc/IPED/issues/1335#issuecomment-1645622285. I placed the model in the iped models folder and change variable whisperModel to 'models/models--dwhoelz--whisper-medium-pt-ct2'.

Apr 16 '24 18:04 gfd2020

I don't know if it's a proxy failure, but I couldn't download the 'medium' model. It starts to download and gives an error with 150 MB.

Seems a network issue, you can try to clean the local cache, AFAIK it is located in user_home/.cache

I'm using the medium model created in the topic #1335 (comment).

Based on my tests, if possible, I would suggest the default medium model over that one. That one was significantly better on CommonVoice, but it was fine tuned on it, and worse on others, so possibly there is a bias here. It also returns inconsistent results for numbers, sometimes returning arabic numbers and sometimes returning numbers as written text, and default Whisper always returns Arabic numbers, that seems an issue in fine tuning to me. And fine tuning, if not properly done, can make the model generalization to unseen audios worse.

PS: Jonatas Grosman's fine tuning, for example, always returns numbers as written text. It was fine tuned on Commom Voice, but results improved in many others.

Apr 16 '24 23:04 lfcnassif

I finished my tests here and everything works great, including remote transcription.

I was only able to download the medium model directly from the website. I compared each one manually and each has advantages and disadvantages.

@lfcnassif , Will this PR be part of version 4.2 ?

Apr 17 '24 18:04 gfd2020

Thank you @gfd2020 for testing!

@lfcnassif , Will this PR be part of version 4.2 ?

For sure. And also #1341.

Apr 17 '24 20:04 lfcnassif

Just updated the code to use WhisperX, since it is much much faster for long audios when using a GPU and its accuracy is very similar to Faster-Whisper (see #1335). It is also faster than Faster-Whisper on CPU, because it uses a VAD filter to ignore audio parts without human speech.

But, to make probability scores to be returned, I had to apply this pending PR on WhisperX: https://github.com/m-bain/whisperX/pull/413

To use the patched version, install our WhisperX fork with command below inside IPED embedded python: pip install git+https://github.com/sepinf-inc/whisperx.git@confidence_score

You must also put FFmpeg on PATH and install gputil inside IPED python: pip install gputil

Before merging this, I'll externalize compute_type and batch_size params to the config file.

In the future, we should patch WhisperX even more to transcribe many audios at the same time using batches (see #1539).

Apr 28 '24 02:04 lfcnassif

I just finished my planned changes, pushing a new python package including the docopt-0.6.2 lib causing issues with WhisperX installation, now it should be easier.

Since I made several changes, I would appreciate a lot if you @marcus6n and @gfd2020 could test this again before merging, thanks in advance!

Apr 29 '24 01:04 lfcnassif

Since I made several changes, I would appreciate a lot if you @marcus6n and @gfd2020 could test this again before merging, thanks in advance!

My tests will take a while because I'm having trouble installing whisperX on GPU. On CPU works. The torch version must be greater than 2.0. I'm getting around it but it will take a while.

I think some additional tips can be given on the wiki for installing on the GPU. I believe that for an average user it will be very difficult to install...

Apr 29 '24 19:04 gfd2020

My tests will take a while because I'm having trouble installing whisperX on GPU. On CPU works. The torch version must be greater than 2.0. I'm getting around it but it will take a while.

I think some additional tips can be given on the wiki for installing on the GPU. I believe that for an average user it will be very difficult to install...

Thanks @gfd2020!

The only issue I had was with docopt dependency not installing into IPED embedded python, did you face it before I included it in the package?

The GPU steps was the same I did for faster-whisper after installing whisperx, I have just overwritten pytorch with one of the commands at https://pytorch.org/get-started/locally/ because I already had CUDA toolkit and cuDNN installed and set on PATH.

Is your issue related to some incompatibility between Torch 2 and your GPU card?

Apr 29 '24 20:04 lfcnassif

The only issue I had was with docopt dependency not installing into IPED embedded python, did you face it before I included it in the package?

Yes. I was only able to install it in Python from this PR. In the python that comes with that separate package, I couldn't do it.

The GPU steps was the same I did for faster-whisper after installing whisperx, I have just overwritten pytorch with one of the commands at https://pytorch.org/get-started/locally/ because I already had CUDA toolkit and cuDNN installed and set on PATH.

I have to do more tests here. I'll try to install it with this link you sent me.

Is your issue related to some incompatibility between Torch 2 and your GPU card?

I managed to get it to work with torch 2.0, even with an old GPU. WhisperX works ok after I fix the error 1 below.

Now I wanted to show some things that appeared.

It seems that ffmpeg is a prerequisite for whisperX to work. I ran it on a computer without ffmpeg in the path and it gave the following error: [ERROR] [task.transcript.AbstractTranscriptTask] Unexpected exception while transcribing: audios/9.wav java.lang.RuntimeException: Transcription failed, returned: FileNotFoundError(2, 'O sistema não pode encontrar o arquivo especificado', None, 2, None) at iped.engine.task.transcript.Wav2Vec2TranscriptTask.transcribeWavPart(Wav2Vec2TranscriptTask.java:271) ~[iped-engine-4.2-snapshot.jar:?]

I found this link with a tip and it really worked after putting ffmpeg in the path. https://stackoverflow.com/questions/73845566/openai-whisper-filenotfounderror-winerror-2-the-system-cannot-find-the-file

These two warnings appeared with the medium model:

[WARN] [task.transcript.WhisperTranscriptTask$1] Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. [WARN] [task.transcript.WhisperTranscriptTask$1] Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu118. Bad things might happen unless you revert torch to 1.x.

Apr 30 '24 12:04 gfd2020

It seems that ffmpeg is a prerequisite for whisperX to work. I ran it on a computer without ffmpeg in the path and it gave the following error: [ERROR] [task.transcript.AbstractTranscriptTask] Unexpected exception while transcribing: audios/9.wav java.lang.RuntimeException: Transcription failed, returned: FileNotFoundError(2, 'O sistema não pode encontrar o arquivo especificado', None, 2, None) at iped.engine.task.transcript.Wav2Vec2TranscriptTask.transcribeWavPart(Wav2Vec2TranscriptTask.java:271) ~[iped-engine-4.2-snapshot.jar:?]

That's sad news, I removed FFmpeg from PATH and just reproduced it. Yesterday I tested both faster-whisper and whisperX into a VM with a fresh Windows 10 install and strangely that error didn't happen, the only dependency needed was the MS Visual C++ Redistributable 2015-2019 package.

3. These two warnings appeared with the medium model:

[WARN] [task.transcript.WhisperTranscriptTask$1] Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. [WARN] [task.transcript.WhisperTranscriptTask$1] Model was trained with torch 1.10.0+cu102, yours is 2.3.0+cu118. Bad things might happen unless you revert torch to 1.x.

Those warnings also happen here. But all tests done on #1335 were with those warnings present.

Apr 30 '24 14:04 lfcnassif

That's sad news, I removed FFmpeg from PATH and just reproduced it. Yesterday I tested both faster-whisper and whisperX into a VM with a fresh Windows 10 install and strangely that error didn't happen, the only dependency needed was the MS Visual C++ Redistributable 2015-2019 package.

Couldn't you put ffmpeg.exe in the iped tools? Is the problem putting it in the path?

Those warnings also happen here. But all tests done on #1335 were with those warnings present.

Ok, just to let you know about them.

Apr 30 '24 14:04 gfd2020

Couldn't you put ffmpeg.exe in the iped tools? Is the problem putting it in the path?

It's possible, but on #1267 @wladimirleite did a good job to remove ffmpeg as dependency, since we already use mplayer for video related stuff...

Ok, just to let you know about them.

Thanks!

Apr 30 '24 14:04 lfcnassif

Yesterday I tested both faster-whisper and whisperX into a VM with a fresh Windows 10 install and strangely that error didn't happen

My fault, I tested again into the VM and WhisperX returns error without FFmpeg. I just added an explicit check and better error message to the user if it is not found.

Apr 30 '24 17:04 lfcnassif

My fault, I tested again into the VM and WhisperX returns error without FFmpeg. I just added an explicit check and better error message to the user if it is not found.

Is there no way to modify the Python code to search for ffmpeg in a relative path within iped?

Apr 30 '24 18:04 gfd2020

Is there no way to modify the Python code to search for ffmpeg in a relative path within iped?

We can set the PATH env var of the main IPED process from the start up process and point to an embedded ffmpeg. But I'm not sure if we should embed ffmpeg and actually I'm thinking about offering both faster-whipser and whisperx as suggested by @rafael844 because faster-whisper doesn't have ffmpeg dependency and whisperx has many dependencies that may cause conflicts with other modules (now or in the future).

Apr 30 '24 19:04 lfcnassif

Can I do a small step by step guide to install the requirements on the GPU? I did some tests here and everything worked.

I had to make some modifications to the code to be able to use it in an environment without an internet connection and point to the local model.

So the modelName parameter accepts the model name, relative path ( iped folder) and absolute path.

Examples: whisperModel = medium whisperModel = models/my_model whisperModel = C:/my_model

try:
    import os
    localModel = False
    localPath = os.path.join(os.getcwd(), modelName)
    if os.path.exists(modelName) and os.path.isabs(modelName):
        localModel = True
        localPath = modelName
    elif os.path.exists(localPath): 
        localModel = True        
    if localModel:
        import torch
        from whisperx.vad import load_vad_model            
        model_fp = os.path.join(localPath, "whisperx-vad-segmentation.bin")
        vad_model = load_vad_model(torch.device(deviceNum), vad_onset=0.500, vad_offset=0.363, use_auth_token=None, model_fp=model_fp)
        model = whisperx.load_model(localPath, device=deviceId, device_index=deviceNum, threads=threads, compute_type=compute_type, language=language, vad_model=vad_model)
    else:
        model = whisperx.load_model(modelName, device=deviceId, device_index=deviceNum, threads=threads, compute_type=compute_type, language=language)

Apr 30 '24 20:04 gfd2020

Can I do a small step by step guide to install the requirements on the GPU?

If it is independent of user environment or hardware, for sure! The wiki is publicly editable.

Maybe above code won't work if IPED is executed outside from its folder. For that, we use System.getProperty('iped.root') to get IPED's root folder.

Apr 30 '24 20:04 lfcnassif

IPED IPED copied to clipboard

#1823 whisper transcription

IPED
IPED copied to clipboard