CTranslate2 Feature request: AMD GPU support with oneDNN AMD support

Hi, CTranslate2 uses oneDNN. oneDNN latest versions has support for AMD GPU. It require Intel oneAPI DPC++. The same approach can potentially enable NVIDIA GPU support too.

It would help running the MT models on AMD GPUs. With Rocm, this would be a full opensource way to run MT models in GPUs.

Thanks

Feb 09 '23 10:02 santhoshtr

Hello,

Currently we only use oneDNN for specific operators such as matrix multiplications and convolutions, but a full MT models contains many other operators (softmax, layer norm, gather, concat, etc.). Even though some of them are available in oneDNN, it would require quite some work to specialize all operations for AMD GPUs.

At this time I don't plan to work on this feature, but it would indeed be a nice one to have!

Feb 09 '23 13:02 guillaumekln

I wanted to try faster whisper on a Intel A770 dGPU 16GB. A complete use of oneDNN could also enable that hardware support.

Apr 04 '23 20:04 leuc

Migrating a transcription component to faster-whisper, and using an AMD GPU, I'd also appreciate faster-whisper with ROCm support even more.

Apr 05 '23 12:04 towel

@towel did you manage to get faster-whisper working on AMD ?

Apr 11 '23 20:04 phineas-pta

any way to run with an amd gpu?

May 15 '23 19:05 CristianPi

Any update on this?

Aug 29 '23 05:08 MidnightKittenCat

I still don't plan to work on this at this time, and as far as I know no one else is working on this. I expect it would be quite some work to have a complete ROCm support.

Aug 29 '23 08:08 guillaumekln

I had a go at converting the existing cuda stuff to rocm a few months ago but could never get it to build, not surprising as I have zero C++ or cmakelists skills.

curand, cublas, cudnn, cuda, cub appear to map to hip with minor adjustments, but I could never get the cmakelists to include thrust (the version supplied by rocm) and it always halted compiling due to producing too many errors.

Sep 10 '23 05:09 lenhone

I started trying to port CTranslate2 to ROCm last weekend and decided to share my (non-working) results here. The code is available in the rocm Branch of my fork.

Basically, hipify was able to convert most of the code automatically. I added a new CMake config option to enable compiling with ROCm, and so far calling the HIP compiler works, however it breaks the other options and requires a CMake version new enough to have HIP language support.

Current issues are some CUDA library dependencies I did not look at yet, and the use of bfloat16 data type. While latest ROCm has a (according to this GH issue -> https://github.com/RadeonOpenCompute/ROCm/issues/2534) drop-in replacement for the CUDA bf16, it currently has some issues in missing operators. Therefore, I'm trying to completely disable bf16 for now, but without luck so far.

This work has right now just the goal of making it work, and not integrating HIP/ROCm into the (CMake) infrastructure.

In case someone wants to have a look at the code and help porting, feel free to look at my fork. Unfortunately, I don't expect to have much time in the near future for this project.

Oct 26 '23 23:10 TheJKM

This is awesome dude. Wish I had programming experience to help with this, but alas I don't. I've been looking for ways to enable gpu acceleration for amd gpus using ctranslate2...Let me know if I can help in any way, whether it be by testing or what have you.

Oct 26 '23 23:10 BBC-Esq

Have you gotten it to work at all yet?

Oct 26 '23 23:10 BBC-Esq

Have you gotten it to work at all yet?

“I started trying to port CTranslate2 to ROCm last weekend and decided to share my (non-working) results here”

I believe that should answer your question.

Oct 26 '23 23:10 MidnightKittenCat

I started trying to port CTranslate2 to ROCm last weekend and decided to share my (non-working) results here. The code is available in the rocm Branch of my fork.

Basically, hipify was able to convert most of the code automatically. I added a new CMake config option to enable compiling with ROCm, and so far calling the HIP compiler works, however it breaks the other options and requires a CMake version new enough to have HIP language support.

Current issues are some CUDA library dependencies I did not look at yet, and the use of bfloat16 data type. While latest ROCm has a (according to this GH issue -> ROCm/ROCm#2534) drop-in replacement for the CUDA bf16, it currently has some issues in missing operators. Therefore, I'm trying to completely disable bf16 for now, but without luck so far.

This work has right now just the goal of making it work, and not integrating HIP/ROCm into the (CMake) infrastructure.

In case someone wants to have a look at the code and help porting, feel free to look at my fork. Unfortunately, I don't expect to have much time in the near future for this project.

Thanks for sharing, much interested in the ripple effects this port may have for others projects.

There's now ROCM 6.0 available which I believe addresses specifically what you're referencing.

FYI: https://repo.radeon.com/amdgpu/6.0/ubuntu/dists/jammy/

I've tried all kinds of dumb uninformed stuff trying to get libretranslate to work with rocm to no avail. It depens on too recent cuda to be tricked by rocm. Latest pytorch+rocm5.7 also did not work out well.

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/3rd-party/pytorch-install.html

Dec 31 '23 01:12 commandline-be

So, would it make sense to create an experimental version of ctranslate2 using a more recnet oneDNN which does have AMD GPU support ?

from https://github.com/oneapi-src/oneDNN oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. oneDNN is part of oneAPI. The library is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. oneDNN has experimental support for the following architectures: NVIDIA* GPU, AMD* GPU, OpenPOWER* Power ISA (PPC64), IBMz* (s390x), and RISC-V.

https://github.com/oneapi-src/oneDNN?tab=readme-ov-file#system-requirements SYCL runtime with AMD GPU support requires oneAPI DPC++ Compiler with support for HIP AMD AMD ROCm, version 5.3 or later MIOpen, version 2.18 or later (optional if AMD ROCm includes the required version of MIOpen) rocBLAS, version 2.45.0 or later (optional if AMD ROCm includes the required version of rocBLAS)

https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/amd/README.md Support for AMD backend is implemented via SYCL HIP backend. The feature is disabled by default. Users must enable it at build time with a CMake option DNNL_GPU_VENDOR=AMD. The AMD GPUs can be used via oneDNN engine abstraction. The engine should be created using dnnl::engine::kind::gpu engine kind or the user can provide a sycl::device objects that corresponds to AMD GPUs.

Jan 03 '24 07:01 commandline-be

As said in the Feb 2023 comment "Even though some of them are available in oneDNN, it would require quite some work to specialize all operations for AMD GPUs." Since no one is making those changes, it won't move on.

Jan 03 '24 18:01 vince62s

I am not a developer but I work at AMD and handle developer relationships. We would like to assist with the effort to enable CTranslate2 for AMD dGPUs and iGPU. We will have engineers investigate, but we may also be able to provide hardware to the lead contributors of this effort. Please contact me via michael dot katz at amd dot com if this would help.

Mar 21 '24 13:03 katzmike

is there any update on this?

Jul 01 '24 17:07 radna0

is there any update on this?

I suspect Lisa and Jensen have a deal that AMD only gets the crumbs from the AI pie. So there is nothing left for us peasant to continue paying the nvidia tax.

Jul 05 '24 12:07 kvrban

So, would it make sense to create an experimental version of ctranslate2 using a more recnet oneDNN which does have AMD GPU support ?

from https://github.com/oneapi-src/oneDNN oneAPI Deep Neural Network Library (oneDNN) is an open-source cross-platform performance library of basic building blocks for deep learning applications. oneDNN is part of oneAPI. The library is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. oneDNN has experimental support for the following architectures: NVIDIA* GPU, AMD* GPU, OpenPOWER* Power ISA (PPC64), IBMz* (s390x), and RISC-V.

https://github.com/oneapi-src/oneDNN?tab=readme-ov-file#system-requirements SYCL runtime with AMD GPU support requires oneAPI DPC++ Compiler with support for HIP AMD AMD ROCm, version 5.3 or later MIOpen, version 2.18 or later (optional if AMD ROCm includes the required version of MIOpen) rocBLAS, version 2.45.0 or later (optional if AMD ROCm includes the required version of rocBLAS)

https://github.com/oneapi-src/oneDNN/blob/main/src/gpu/amd/README.md Support for AMD backend is implemented via SYCL HIP backend. The feature is disabled by default. Users must enable it at build time with a CMake option DNNL_GPU_VENDOR=AMD. The AMD GPUs can be used via oneDNN engine abstraction. The engine should be created using dnnl::engine::kind::gpu engine kind or the user can provide a sycl::device objects that corresponds to AMD GPUs.

Does that mean Intel ARC Gpus can also be supported?

Jul 13 '24 19:07 DDXDB

https://bbs.archlinux.org/viewtopic.php?pid=2183865#p2183865

Jul 14 '24 13:07 chboishabba

This has just been released: https://docs.scale-lang.com/

Could someone more technical see whether this toolkit would make running ctranslate2 on AMD possible?

Jul 30 '24 18:07 yeetmanpat

For whisper.cpp at least, it now supports vulkan as a gpu backend. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client

Aug 02 '24 02:08 genehand

For whisper.cpp at least, it now supports vulkan as a gpu backend. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client

Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable.

Aug 02 '24 04:08 tannisroot

Try whisperx if you are able to use faster-whisper, it does distraction and has a better VAD...

On Fri, Aug 2, 2024, 2:10 PM Aleksandr Oleinikov @.***> wrote:

For whisper.cpp at least, it now supports vulkan as a gpu backend https://github.com/ggerganov/whisper.cpp/pull/2302. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client

Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable.

— Reply to this email directly, view it on GitHub https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomment-2264500385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGM4B3STLKCH2XVHLRM6NCLZPMBDVAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRUGUYDAMZYGU . You are receiving this because you commented.Message ID: @.***>

Aug 02 '24 05:08 chboishabba

Diarisation

On Fri, Aug 2, 2024, 3:21 PM Johl Brown @.***> wrote:

Try whisperx if you are able to use faster-whisper, it does distraction and has a better VAD...

On Fri, Aug 2, 2024, 2:10 PM Aleksandr Oleinikov @.***> wrote:

For whisper.cpp at least, it now supports vulkan as a gpu backend https://github.com/ggerganov/whisper.cpp/pull/2302. With home assistant this is working well for me through https://github.com/ser/wyoming-whisper-api-client

Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable.

— Reply to this email directly, view it on GitHub https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomment-2264500385, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGM4B3STLKCH2XVHLRM6NCLZPMBDVAVCNFSM6AAAAAAUWK5BQ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRUGUYDAMZYGU . You are receiving this because you commented.Message ID: @.***>

Aug 02 '24 05:08 chboishabba

whisperx

I don't believe there is a way to hook it up to the Wyoming protocol, which is my sole usecase for it.

Aug 02 '24 07:08 tannisroot

Personally, on my hardware, even with GPU acceleration, whisper.cpp is way slower than faster-whisper using the same model and CPU, and the transcription time is also very unpredictable.

Alright with rocm 6.2 supporting my gpu now I was curious to do a quick test. Using the medium model and this test file, here's what I'm seeing:

project	backend	beam size	transcribe time
faster-whisper	cpu	5	1m26.002s
whisper.cpp	vulkan	5	1m24.906s
whisper.cpp	rocm	5	59.834s
faster-whisper	rocm	5	37.649s

This is with an i5-10400F and RX5700 using code adapted from the readme:

model_size = "medium"
model = WhisperModel(model_size, device="cpu", compute_type="int8", cpu_threads=12)
segments, info = model.transcribe("tests/data/physicsworks.wav", beam_size=5, language="en")

Edit: Lowered cpu_threads from the default 16 with improved results. Edit 2: Added faster-whisper with @arlo-phoenix's fork

Aug 03 '24 21:08 genehand

I am not a developer but I work at AMD and handle developer relationships. We would like to assist with the effort to enable CTranslate2 for AMD dGPUs and iGPU. We will have engineers investigate, but we may also be able to provide hardware to the lead contributors of this effort. Please contact me via michael dot katz at amd dot com if this would help.

can a first be to test this against Zluda? https://github.com/vosen/ZLUDA ( run CUDA on AMD)

I look forward to being able to run ctranslate2 with GPU acceleration without requiring to buy an nvidia

Aug 05 '24 11:08 commandline-be

I ported CTranslate2 over to ROCm. My fork is here: https://github.com/arlo-phoenix/CTranslate2-rocm and install instructions can be found under README_ROCM.md. I also wrote about the issues I had and the libraries using CT2 I tested.

Status Tracker

[x] faster whisper
[x] whisperX
[ ] bfloat16 (main blocker for upstreaming imo)
[ ] sync with upstream (I intentionally went back a couple commits to avoid having to deal with fa2 and AWQ)

Instead of using oneDNN I just hipified the repo and extracted HIP to CUDA function mapping to create a preprocessor solution similar to projects like llama.cpp. Besides the listed stuff it is feature complete and works very well. I included some benchmark scripts with the file from https://github.com/OpenNMT/CTranslate2/issues/1072#issuecomment-2267170398 (@genehand would be nice if you could try this and add the numbers to a table!). On my RX6800 I'm getting 11s-12s with faster_whisper and 4.2s with whisperX. For RDNA this should now be the fastest working whisper inference solution :)

Btw should we split issues up? This is two combined into one. I personally believe porting all operators to oneDNN is far too much effort and might not even lead to good performance. This repo hipified quite well, I was able to use simple defines from HIP to CUDA functions for the majority of the project. I only had to rewrite the conv1d operator from scratch since hipDNN isn't maintained anymore.

Aug 06 '24 17:08 arlo-phoenix

@arlo-phoenix Can you add the "issues" tab on your github so we can communicate that way? I'm possibly interested in incorporating this into my projects.

Aug 06 '24 18:08 BBC-Esq

CTranslate2 CTranslate2 copied to clipboard

Feature request: AMD GPU support with oneDNN AMD support

Status Tracker

CTranslate2
CTranslate2 copied to clipboard