faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

Added multiprocessing for cpu processing

Open joiemoie opened this issue 1 year ago • 6 comments

Because of the Python GIL, the preprocessing doesn't fully efficiently use all the CPU cores. By spawning the CPU tasks in its own multiprocess, you can get requests that happen on different threads to fully utilize the CPU cores.

joiemoie avatar Jan 19 '24 04:01 joiemoie

Does this have any actual impact on performance? Do you have benchmarks?

Purfview avatar Jan 19 '24 09:01 Purfview

Yes! I can send my data and test case later today.

On Fri, Jan 19, 2024 at 1:15 AM Purfview @.***> wrote:

Does this have any actual impact on performance? Do you have benchmarks?

— Reply to this email directly, view it on GitHub https://github.com/SYSTRAN/faster-whisper/pull/648#issuecomment-1900033593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXW5QF2CYM2IVG5IA22GNLYPI2TBAVCNFSM6AAAAABCBKWDXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGAZTGNJZGM . You are receiving this because you authored the thread.Message ID: @.***>

joiemoie avatar Jan 19 '24 18:01 joiemoie

Does this have any actual impact on performance? Do you have benchmarks?

Testing code:

`from faster_whisper import WhisperModel, decode_audio from io import BytesIO import time from fastapi import FastAPI, Request, UploadFile

import nvtx import threading import time import time from concurrent.futures import ThreadPoolExecutor from faster_whisper import WhisperModel, decode_audio

def preprocess_audio(filename): with nvtx.annotate("Decode audio"): return decode_audio(filename)

model = WhisperModel("large-v3", device="cuda", device_index=[0], compute_type="bfloat16", cpu_threads=2, num_workers=2)

def transcribe(model_to_use): start_time = time.time() with nvtx.annotate("Transcribe"):

    segments, info = model_to_use.transcribe("test.wav", language=None, vad_filter=True, word_timestamps=False, vad_parameters={"window_size_samples": 1024}, preprocess_on_multiple_cores=True)

print(f"Single Request Elapsed time: {time.time() - start_time}. Audio duration: {info.duration}")

#this is to clear out memory from the GPUs transcribe(model) transcribe(model)

if name == "main": threads = [] for i in range(20): threads.append(threading.Thread(target=transcribe, args=(model,)))

# thread_1 = threading.Thread(target=transcribe)
start_time = time.time()

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()
print(f"Total Elapsed time: {time.time() - start_time}")

`

Results:

Overall time to pre-process 20 requests before without multicore:

2.7506766319274902 seconds

Overall time to pre-process 20 requests before with multicore:

1.9269721508026123 seconds

Now to test the overhead for a single request.

Overall time to pre-process 1 requests before without multicore:

0.21215391159057617 seconds

Overall time to pre-process 1 request with multicore: Total Elapsed time: 0.21257996559143066

So there's a tradeoff between overhead and spawning the worker process

joiemoie avatar Jan 22 '24 03:01 joiemoie

@joiemoie , hello. Tks for an interesting pull request. From my test (20 requests, device=cpu, model=tiny, cpy_threads=8), I received the overall time as below:

  • without multicore: 14.506s
  • with multicore: 10.917s

That's a pretty significant improvement ! But I think we can improve further. I tried adding this logic to the cpu_preprocessing function:

if not isinstance(audio, np.ndarray):
    audio = decode_audio(
        audio, sampling_rate=feature_extractor.sampling_rate
    )

if vad_filter:
    if vad_parameters is None:
        vad_parameters = VadOptions()
    elif isinstance(vad_parameters, dict):
        vad_parameters = VadOptions(**vad_parameters)

The overall time was 9.633s after my change. I think the logic in the decode_audio function also takes up a significant amount of computation time. What do you think about this idea? And should we add more code logic into the cpu_preprocessing function?

trungkienbkhn avatar Jan 26 '24 08:01 trungkienbkhn

Nice! That's not a bad idea. Please don't merge this in for now. I noticed that there's memory inefficiency, and the pool size needs to be capped or have a parameter set. I'm investigating the memory inefficiency

joiemoie avatar Jan 27 '24 08:01 joiemoie

@joiemoie , hello. Have you finished your work yet :smiley: ?

trungkienbkhn avatar Apr 03 '24 04:04 trungkienbkhn