faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

Best strategy for low-latency, high-throuhgput serving in Multi-GPU setups

Open developer-cade opened this issue 2 years ago • 3 comments

Hi all, I'm dealing with a scenario where I receive simultaneous requests for processing short audio clips (5-7 seconds) from multiple users, making both latency and throughput crucial.

In this context, which approach would be more effective?

  1. Launching N containers, each with a WhisperModel instance configured for a single GPU.
  2. Running a single container with one WhisperModel instance, but configured for N GPUs.

I'm leaning towards the first option because, when using multiple GPUs, the encoding results must be transferred to the CPU, not knowing which GPU will handle the next job. This transfer might introduce significant overhead, especially since I frequently need to infer short-length audio. Could this overhead be a concern in my scenario? (refer to code below)

def encode(self, features: np.ndarray) -> ctranslate2.StorageView:
        # When the model is running on multiple GPUs, the encoder output should be moved
        # to the CPU since we don't know which GPU will handle the next job.
        to_cpu = self.model.device == "cuda" and len(self.model.device_index) > 1

        features = np.expand_dims(features, 0)
        features = get_ctranslate2_storage(features)

P.S: Is it the same with decoding, where the results have to be moved to the CPU because we don't know which GPU will process the next task? I'm not familiar with C++, so it's challenging to understand. Can you tell me where the code specifies how tasks are distributed across multiple GPUs?

Any insights or suggestions are welcome.

developer-cade avatar Mar 30 '24 02:03 developer-cade

@developer-cade Option 1 is better and simplest. If I understand correctly, otherwise you state it the encode and decode is run in same GPU you don´t worry about that data overhead. It is recommended to use two different GPUs (one for encoding and other for decoding) if your GPU memory can not handle the model end to end. Each audio being 5 seconds and 60X acc. for a GPU such as 4080(24gb for reference), you can expect to transcribe about tens of audios each second per GPU.

gongouveia avatar Apr 03 '24 17:04 gongouveia

Whisper is designed to decode long audio files. It processes audio in 30 second chunks. If your input chunks are less than 30 seconds, you'd better use other neural network architecture like Nvidia Conformer. You'll get same accuracy and 10X speedup. If you still want to use Whisper, your best solution would be to combine chunks into 30 seconds.

nshmyrev avatar Apr 03 '24 17:04 nshmyrev

I believe that you are wrong, the audios get padded to 30 seconds, with VAD activation the padding is deleted.

Accuracy is more related to the speexh length and sentence length than with audio clip length. That is how the decoder works

gongouveia avatar Apr 03 '24 17:04 gongouveia