Strange behavior for num_workers and num threads for AMD CPU with nvidia GPUs
Im trying my AMD EPYC 7302 with Nvidia A4000 and A5000. One thing I notice is that with a single request, allowing for both GPUs in the device list and changing num_workers greater than 1 drops throughput. Also changing the number of threads in a single worker is strange. When I profiled with an I9 computer with a 4090 GPU, I noticed it would cleanly use 8 cores when I set that to be the number of threads. However, in the EPYC, the core usage would be kind of random. In either case. Also, I noticed in the I9 pc with 4090, the GPU is almost fully utilized with the kernels while with the epyc with the a4000 and 5000, the kernels aren’t fully utilized. What can i do hardware wise to speed everything up?
Also strange behavior.
When I transcribe a 60 second audio, its faster on the a4000 epyc by half, but 12 second audio is faster on the 4090 by double.
For context, I am trying to figure out how to speed up the workstation that has an AMD EPYC 7302 CPU.
Are you comparing with same quantization type?
Yes, I am. Perhaps I can provide screenshots and code examples when I get back home
On Sat, Dec 30, 2023 at 11:44 AM Purfview @.***> wrote:
Are you comparing with same quantization types.
— Reply to this email directly, view it on GitHub https://github.com/SYSTRAN/faster-whisper/issues/629#issuecomment-1872591739, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXW5QFUPV22O2VNW2655YLYMBVLNAVCNFSM6AAAAABBHOLGP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGU4TCNZTHE . You are receiving this because you authored the thread.Message ID: @.***>
Code:
from faster_whisper import WhisperModel, decode_audio from io import BytesIO import time from fastapi import FastAPI, Request, UploadFile
import nvtx import threading import time import time from concurrent.futures import ThreadPoolExecutor from faster_whisper import WhisperModel, decode_audio
def preprocess_audio(filename): with nvtx.annotate("Decode audio"): return decode_audio(filename)
model = WhisperModel("large-v3", device="cuda", device_index=[0], compute_type="int8_float16", cpu_threads=8, num_workers=1) def transcribe(): start_time = time.time() with nvtx.annotate("Transcribe"):
segments, info = model.transcribe("test.wav", language=None, vad_filter=False, word_timestamps=False, vad_parameters={"window_size_samples": 1024})
list(segments)
print(f"Elapsed time: {time.time() - start_time}. Audio duration: {info.duration}")
#this is to clear out memory from the GPUs transcribe() transcribe()
if name == "main":
threads = []
for i in range(2):
threads.append(threading.Thread(target=transcribe))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
Audio is a 12 second audio piece
Tests: 1 concurrent threads. device_index=[0], cpu_threads=8, num_workers=1 Elapsed time: 0.8794925212860107 1 concurrent threads. device_index=[0], cpu_threads=8, num_workers=2 Elapsed time: 0.8722779750823975 1 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=1 Elapsed time: 1.1095623970031738 1 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=2 Elapsed time: 0.9448699951171875 2 concurrent threads. device_index=[0], cpu_threads=8, num_workers=1 Elapsed time: 1.8385844230651855 2 concurrent threads. device_index=[0], cpu_threads=8, num_workers=2 Elapsed time: 1.6893982887268066 2 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=1 Elapsed time: 1.4608595371246338 2 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=2 Elapsed time: 1.5283257961273193 3 concurrent threads. device_index=[0], cpu_threads=8, num_workers=1 Elapsed time: 2.818450689315796 3 concurrent threads. device_index=[0], cpu_threads=8, num_workers=2 Elapsed time: 2.8611366748809814 3 concurrent threads. device_index=[0], cpu_threads=8, num_workers=3 Elapsed time: 3.1517202854156494 3 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=1 Elapsed time: 2.3359012603759766 3 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=2 Elapsed time: 2.661539077758789 3 concurrent threads. device_index=[0, 1], cpu_threads=8, num_workers=3 Elapsed time: 2.755277395248413
It seems inconsistent, but with 1 GPU, for one and two concurrent threads, 2 num_workers was better than 1. For a single thread, one GPU was best with 1 worker. For two concurrent threads, 2 GPUs and 1 worker was the best. For three concurrent threads, 2 GPUs and one worker was the best.
I think cpu_threads doesn't do anything if on cuda and compute_type="bfloat16" should be fastest for your GPUs.
Dunno much about your multithreading issue.
Do you have a sense of why the transcriptions are not really using Tensor Cores but only CUDA cores? Is there any way to improve utilization here?
Do you have a sense of why the transcriptions are not really using Tensor Cores but only CUDA cores?
With "bfloat16"?
Thanks! Do you have a sense of what are the bottlenecks in the GPU computation? Would it be the number of cores in the CPU? I speculate that's probably not it. Or would you say it's the CPU clock speed? Or memory bandwidth, etc. I just notice still the average tensor core utilization hovers 30%.
On Mon, Jan 1, 2024 at 6:03 AM Purfview @.***> wrote:
Do you have a sense of why the transcriptions are not really using Tensor Cores but only CUDA cores? Is there any way to improve utilization here?
"bfloat16"
— Reply to this email directly, view it on GitHub https://github.com/SYSTRAN/faster-whisper/issues/629#issuecomment-1873340808, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEXW5QDB2COZW3XRA6ZIIW3YMK6ZHAVCNFSM6AAAAABBHOLGP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGM2DAOBQHA . You are receiving this because you authored the thread.Message ID: @.***>
On Windows "Hardware-accelerated GPU scheduling" is hyper option, dunno if there is such thing on Linux.
did you find answers or the best params? i have a task to transcribe millions of files (~5-30min long) and i'm looking for any information)