whisper.cpp
whisper.cpp copied to clipboard
What is the best way to transcribe millions of files
Hello, everyone. I encountered a question when using whisper.cpp. Here's my situation. I have millions of audio files to be transcribed, and I have multiple GPUs available. What is the most effective and convenient way to transcribe all these files as fast as possible?
My current solution is to split all the audio files into different batches, say, 500 audio files as a batch, and use a single GPU to transcribe all the files in the batch with a specified GPU. For examples:
CUDA_VISIBLE_DEVICES=0 ./main -m MODEL_PATH --threads 1 --processors 1 file1 file2 ... file500
CUDA_VISIBLE_DEVICES=1 ./main -m MODEL_PATH --threads 1 --processors 1 file1 file2 ... file500
Then, write all the commands above into a text file and use parallel -j NUM_JOB --lb to run them.
This solution works for me, and I am wondering whether there are better alternatives. Thanks!
The "main" sample recently gained the ability to take a "response file" as a parameter. This is commonly used under MS Windows to work around command-line length limitations. I used this to transcribe a batch of several thousand audio files recently.