faster-whisper Incorporating flash-attention2 [SOLVED] and subsequent testing [ONGOING]

Hello all. Just thought I'd post a question about Flash Attention 2 here:

https://github.com/Dao-AILab/flash-attention

Apparently it's making big waves and seems seems very powerful. Does anyone plan on seeing if it's something that could be included?

Nov 30 '23 12:11 BBC-Esq

not faster whisper issue

require a lot of new c++ code into ctranslate2, so highly doubt devs have time

Nov 30 '23 20:11 phineas-pta

ctranslate2 is backend of faster-whisper, all heavy computations are done in ctranslate2

same for pytorch is backend of openai-whisper

Nov 30 '23 20:11 phineas-pta

Before talking about adopting flash attention, it is necessary to understand what flash attention v2 does and how it helps. Flash attention is an library which reorganises the workflow, parallelise workflow and reducing the number of shifts in memory. This requires the implementer to have sufficient knowledge about the memory management of the gpu and even the original implementer of flash attention v2 only managed to make it compatible for certain gpus as of today. Rather than focusing on making flash attention library compatible with ctranslate2 which may or may not help with the speedup, it would make more sense to improve ctranslate2 further.

Dec 01 '23 11:12 blackpolarz

In some way, you can think of faster-whisper as the customer support where they process your queries to the relevant departments. While you can make a more streamlined customer support or an all rounded customer service, the actual technical work done still depends on the technical team behind the scenes. In this case, ctranslate2 is the technical team. I won't exactly say that faster-whisper doesn't add any functionality. It actually does a bit of preprocessing before putting it into a format where ctranslate2 can understand, then gather whatever ctranslate2 returns to put it in a format where we as users can understand. This is actually different from the insanely faster whisper repo, where they only added CLI support to make it easier for users to use or understand.

Dec 01 '23 12:12 blackpolarz

there is some discussion here on ctranslate2: https://github.com/OpenNMT/CTranslate2/issues/1300

Dec 08 '23 08:12 junchen6072

As an update - the latest versions of ctranslate2 do support this. As such, it's simply a matter of adding a flash_attention flag to https://github.com/SYSTRAN/faster-whisper/blob/4acdb5c619711eb9c0e1779e6fb1a6ff3d68d61b/faster_whisper/transcribe.py#L144

As per https://github.com/OpenNMT/CTranslate2/issues/1300#issuecomment-2047318118

I have verified this locally on my own copy of faster-whisper and it works just fine!

May 17 '24 21:05 jet082

Is this completed @trungkienbkhn? I do not see the relevant commit. It seems like faster whisper needs an argument added to model construction.

May 20 '24 04:05 jet082

@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:

from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)

Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html

May 20 '24 06:05 trungkienbkhn

@jet082 After enabling flash_attention, how much speed boost did you noticed?

May 20 '24 17:05 twicer-is-coder

@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:
from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)
Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html

My apologies, I was a version behind - this works great!

May 21 '24 08:05 jet082

@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:
from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)
Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html
My apologies, I was a version behind - this works great!

Can you share the performance improvement on Flash Attention?

May 21 '24 09:05 twicer-is-coder

Regarding benchmarking faster-whisper with FA - I did not witness any speed boost when loading the faster whisper model with FA. I shared the full details including code snip in this issue that originated in ctranslate2 repo https://github.com/OpenNMT/CTranslate2/issues/1300.

In short, when transcribing ~50 sec audio with the default faster whisper params (beam equals 5) there is no difference if you use FA or not the inference timed measured were almost identical. Does anyone have noticed the same behavior? if not what am I doing wrong?

Thanks in advance,

@Purfview @BBC-Esq

Jun 06 '24 10:06 AvivSham

Regarding benchmarking faster-whisper with FA - I did not witness any speed boost when loading the faster whisper model with FA. I shared the full details including code snip in this issue that originated in ctranslate2 repo OpenNMT/CTranslate2#1300.

In short, when transcribing ~50 sec audio with the default faster whisper params (beam equals 5) there is no difference if you use FA or not the inference timed measured were almost identical. Does anyone have noticed the same behavior? if not what am I doing wrong?

Thanks in advance,

@Purfview @BBC-Esq

Yes, I didn't notice any difference in speed too when FA enabled.

Jun 06 '24 11:06 twicer-is-coder

@Napuh maybe you can help with this?

Jun 06 '24 19:06 AvivSham

@BBC-Esq thanks! I'm using a single sample for inference, I just wonder if the lack of performance is fundamental or origin in the implementation.

@Purfview @Napuh @trungkienbkhn Could you please assist here?

Jun 09 '24 11:06 AvivSham

For information, I have executed benchmark for FlashAttention with GPU NVIDIA H100 and large-v3 model. Below are the results:

1. Speed benchmark: Processing audio with duration 13:19.231s Detected language 'fr' with probability 1.00

System	Beam_size=5	Beam_size=1
Faster-Whisper	34.512s	27.190s
FW with FlashAttention	33.751s	26.607s

2. WER benchmark: Dataset: librispeech_asr Number of audio used for evaluation: 500

System	Beam_size=5	Beam_size=1
Faster-Whisper	2.649	2.325
FW with FlashAttention	2.774	2.252

3. Memory benchmark: GPU name: NVIDIA H100 PCIe GPU device index: 0

System	Maximum increase of RAM	Maximum GPU memory usage	Maximum GPU power usage
Faster-Whisper (beam_size=5)	1257 MiB	5178MiB / 81559MiB	157W / 350W
FW with FlashAttention (beam_size=5)	1251 MiB	4954MiB / 81559MiB	153W / 350W
Faster-Whisper (beam_size=1)	1243 MiB	4634MiB / 81559MiB	164W / 350W
FW with FlashAttention (beam_size=1)	1327 MiB	4602MiB / 81559MiB	164W / 350W

=> Speed has improved a bit with FlashAttention Note that currently, FlashAttention only support for Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)

Jun 10 '24 11:06 trungkienbkhn

@trungkienbkhn Thank you for running this benchmark! Since we see marginal improvement (which can be explained by environment inconsistency), I wonder if this is the expected behavior or if there is an implementation issue.

Jun 10 '24 13:06 AvivSham

Correct me if I'm wrong, but all of those results seem within the margin of error. Just to confirm, you're running ctranslate 4.3? 4.3.1 just came out but don't think it'd make a difference.

I executed with ctranslate2 4.2.1. But after discussing with @minhthuc2502, he confirmed that there is no difference when running fw between 4.2.1 and 4.3.0.

Since we see marginal improvement (which can be explained by environment inconsistency), I wonder if this is the expected behavior or if there is an implementation issue.

Yes it's expected behavior. There has been improvement, but not much. I think it might be because the number of input tokens is not large enough (maximum only 30 seconds per segment) and the fw large-v3 model is not too large in size (3GB).

Jun 10 '24 15:06 trungkienbkhn

I tried to test Phi3-128k to test whether FA works for extremely long prompt lengths, however, I couldn't get it to run correctly. Does anyone have a conversion of the Phi3-mini-128k or the Phi3-small-8k models by chance? It seems like I can convert the 128k mini phi3 even though it still throws an error when I try to run inference on it, but the small-8k version I can't even convert. Here's my conversion script for anyone's enjoyment. Please let me know. convert_ctranslate2.txt

Just FYI, converting these models requires ctranslate2==4.3+

You should post this question in issues of ctranslate2 repo

Jun 16 '24 15:06 trungkienbkhn

Hi guys,

I'm currently trying to use whisper with ct2 and flash attention as @trungkienbkhn's response. However, I always get this line "Flash attention 2 is not supported" when trying to inference some samples. Here is my environment:

A6000, CUDA 12.3, cuDNN 9.0, Python 3.10
Flash attention version 2.7.0.post2 (after using the default install line).

And these are my steps to run inference:

Load whisper model using huggingface
Convert to ct2 with this line ct2-transformers-converter --model models/whisper-large-v3-turbo --output_dir custom-faster-whisper-large-v3-turbo \ --copy_files tokenizer.json preprocessor_config.json --quantization float16
Finally I use from faster_whisper import WhisperModel model = WhisperModel('./models/whisper-large-v3-turbo', device="cuda", compute_type = 'float16', flash_attention=True)

What could be the things that I had done incorrectly ? Please help !!! Thank you in advance <3

Nov 17 '24 04:11 davidan208

Hi guys,

I'm currently trying to use whisper with ct2 and flash attention as @trungkienbkhn's response. However, I always get this line "Flash attention 2 is not supported" when trying to inference some samples. Here is my environment:

A6000, CUDA 12.3, cuDNN 9.0, Python 3.10

Flash attention version 2.7.0.post2 (after using the default install line).

And these are my steps to run inference:

Load whisper model using huggingface

Convert to ct2 with this line ct2-transformers-converter --model models/whisper-large-v3-turbo --output_dir custom-faster-whisper-large-v3-turbo \ --copy_files tokenizer.json preprocessor_config.json --quantization float16

Finally I use from faster_whisper import WhisperModel model = WhisperModel('./models/whisper-large-v3-turbo', device="cuda", compute_type = 'float16', flash_attention=True)

What could be the things that I had done incorrectly ? Please help !!! Thank you in advance <3

same problem here

Dec 04 '24 18:12 virtualmartire