whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

What is the best way to transcribe millions of files

Open xiabingquan opened this issue 1 year ago • 1 comments

Hello, everyone. I encountered a question when using whisper.cpp. Here's my situation. I have millions of audio files to be transcribed, and I have multiple GPUs available. What is the most effective and convenient way to transcribe all these files as fast as possible?

My current solution is to split all the audio files into different batches, say, 500 audio files as a batch, and use a single GPU to transcribe all the files in the batch with a specified GPU. For examples:

CUDA_VISIBLE_DEVICES=0  ./main -m MODEL_PATH  --threads 1 --processors 1 file1 file2 ... file500
CUDA_VISIBLE_DEVICES=1  ./main -m MODEL_PATH  --threads 1 --processors 1 file1 file2 ... file500

Then, write all the commands above into a text file and use parallel -j NUM_JOB --lb to run them.

This solution works for me, and I am wondering whether there are better alternatives. Thanks!

xiabingquan avatar Dec 14 '23 05:12 xiabingquan

The "main" sample recently gained the ability to take a "response file" as a parameter. This is commonly used under MS Windows to work around command-line length limitations. I used this to transcribe a batch of several thousand audio files recently.

ulatekh avatar Jun 04 '24 21:06 ulatekh