Added multiprocessing for cpu processing
Because of the Python GIL, the preprocessing doesn't fully efficiently use all the CPU cores. By spawning the CPU tasks in its own multiprocess, you can get requests that happen on different threads to fully utilize the CPU cores.
Does this have any actual impact on performance? Do you have benchmarks?
Yes! I can send my data and test case later today.
On Fri, Jan 19, 2024 at 1:15 AM Purfview @.***> wrote:
Does this have any actual impact on performance? Do you have benchmarks?
— Reply to this email directly, view it on GitHub https://github.com/SYSTRAN/faster-whisper/pull/648#issuecomment-1900033593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXW5QF2CYM2IVG5IA22GNLYPI2TBAVCNFSM6AAAAABCBKWDXKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBQGAZTGNJZGM . You are receiving this because you authored the thread.Message ID: @.***>
Does this have any actual impact on performance? Do you have benchmarks?
Testing code:
`from faster_whisper import WhisperModel, decode_audio from io import BytesIO import time from fastapi import FastAPI, Request, UploadFile
import nvtx import threading import time import time from concurrent.futures import ThreadPoolExecutor from faster_whisper import WhisperModel, decode_audio
def preprocess_audio(filename): with nvtx.annotate("Decode audio"): return decode_audio(filename)
model = WhisperModel("large-v3", device="cuda", device_index=[0], compute_type="bfloat16", cpu_threads=2, num_workers=2)
def transcribe(model_to_use): start_time = time.time() with nvtx.annotate("Transcribe"):
segments, info = model_to_use.transcribe("test.wav", language=None, vad_filter=True, word_timestamps=False, vad_parameters={"window_size_samples": 1024}, preprocess_on_multiple_cores=True)
print(f"Single Request Elapsed time: {time.time() - start_time}. Audio duration: {info.duration}")
#this is to clear out memory from the GPUs transcribe(model) transcribe(model)
if name == "main": threads = [] for i in range(20): threads.append(threading.Thread(target=transcribe, args=(model,)))
# thread_1 = threading.Thread(target=transcribe)
start_time = time.time()
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print(f"Total Elapsed time: {time.time() - start_time}")
`
Results:
Overall time to pre-process 20 requests before without multicore:
2.7506766319274902 seconds
Overall time to pre-process 20 requests before with multicore:
1.9269721508026123 seconds
Now to test the overhead for a single request.
Overall time to pre-process 1 requests before without multicore:
0.21215391159057617 seconds
Overall time to pre-process 1 request with multicore: Total Elapsed time: 0.21257996559143066
So there's a tradeoff between overhead and spawning the worker process
@joiemoie , hello. Tks for an interesting pull request. From my test (20 requests, device=cpu, model=tiny, cpy_threads=8), I received the overall time as below:
- without multicore: 14.506s
- with multicore: 10.917s
That's a pretty significant improvement !
But I think we can improve further. I tried adding this logic to the cpu_preprocessing function:
if not isinstance(audio, np.ndarray):
audio = decode_audio(
audio, sampling_rate=feature_extractor.sampling_rate
)
if vad_filter:
if vad_parameters is None:
vad_parameters = VadOptions()
elif isinstance(vad_parameters, dict):
vad_parameters = VadOptions(**vad_parameters)
The overall time was 9.633s after my change. I think the logic in the decode_audio function also takes up a significant amount of computation time.
What do you think about this idea? And should we add more code logic into the cpu_preprocessing function?
Nice! That's not a bad idea. Please don't merge this in for now. I noticed that there's memory inefficiency, and the pool size needs to be capped or have a parameter set. I'm investigating the memory inefficiency
@joiemoie , hello. Have you finished your work yet :smiley: ?