whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

Experiments with GPU CUDA acceleration...sort of

Open Topping1 opened this issue 1 year ago • 17 comments

CUDA toolkit documentation link states that NVBLAS is a drop-in BLAS replacement. Also states: "The NVBLAS Library is a GPU-accelerated Library that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it speed up on a GPU." One of those Level-3 routines is sgemm (matrix multiplication), that is used extensively by ggml.c. In theory, IF CORRECTLY CONFIGURED, NVBLAS can intercept the calls to the OpenBLAS function cblas_sgemm and accelerate it using a CUDA compatible graphics card installed in the system. There is not much information about the specific steps to enable it, but I could piece together this step-by-step:

1-Install CUDA toolkit from the official link link

2-create the file /etc/nvblas.conf with the following contents:

NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB /usr/lib/x86_64-linux-gnu/libopenblas.so
NVBLAS_GPU_LIST ALL

/usr/lib/x86_64-linux-gnu/libopenblas.so is the location of libopenblas.so on my system, You have to point it to the correct location (should not be that different).

3-create an environment variable pointing to nvblas.conf export NVBLAS_CONFIG_FILE=/etc/nvblas.conf

4-create an environment variable pointing to the location of libnvblas.so export LD_PRELOAD=/usr/local/cuda/lib64/libnvblas.so.11 here is not clear which .so file is needed. For example on my system I can find the following /usr/local/cuda/lib64/libnvblas.so /usr/local/cuda/lib64/libnvblas.so.11 /usr/local/cuda/lib64/libnvblas.so.11.11.3.6 /usr/local/cuda-11.8/lib64/libnvblas.so /usr/local/cuda-11.8/lib64/libnvblas.so.11 /usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6 /usr/local/cuda-11.8/lib64/libnvblas.so /usr/local/cuda-11.8/lib64/libnvblas.so.11 /usr/local/cuda-11.8/lib64/libnvblas.so.11.11.3.6

5-Download source code of whisper.cpp with git clone https://github.com/ggerganov/whisper.cpp

6-Inside the whisper.cpp folder, execute cmake -DWHISPER_SUPPORT_OPENBLAS=ON .

7-Inside the whisper.cpp folder, execute make you should have now a compiled main executable with BLAS support turned on.

8-now, at least in my case, when I run a test transcription, the program confirms that is using BLAS (BLAS = 1), but NVBLAS does not seem to be intercepting the calls. NVTOP does not show GPU usage and no nvblas.log is created.

If someone can figure out how to make this work, it has the potential to accelerate substantially the transcription speed on x64.

Topping1 avatar Dec 06 '22 02:12 Topping1