Incorporating flash-attention2 [SOLVED] and subsequent testing [ONGOING]
Hello all. Just thought I'd post a question about Flash Attention 2 here:
https://github.com/Dao-AILab/flash-attention
Apparently it's making big waves and seems seems very powerful. Does anyone plan on seeing if it's something that could be included?
not faster whisper issue
require a lot of new c++ code into ctranslate2, so highly doubt devs have time
ctranslate2 is backend of faster-whisper, all heavy computations are done in ctranslate2
same for pytorch is backend of openai-whisper
Before talking about adopting flash attention, it is necessary to understand what flash attention v2 does and how it helps. Flash attention is an library which reorganises the workflow, parallelise workflow and reducing the number of shifts in memory. This requires the implementer to have sufficient knowledge about the memory management of the gpu and even the original implementer of flash attention v2 only managed to make it compatible for certain gpus as of today. Rather than focusing on making flash attention library compatible with ctranslate2 which may or may not help with the speedup, it would make more sense to improve ctranslate2 further.
In some way, you can think of faster-whisper as the customer support where they process your queries to the relevant departments. While you can make a more streamlined customer support or an all rounded customer service, the actual technical work done still depends on the technical team behind the scenes. In this case, ctranslate2 is the technical team. I won't exactly say that faster-whisper doesn't add any functionality. It actually does a bit of preprocessing before putting it into a format where ctranslate2 can understand, then gather whatever ctranslate2 returns to put it in a format where we as users can understand. This is actually different from the insanely faster whisper repo, where they only added CLI support to make it easier for users to use or understand.
there is some discussion here on ctranslate2: https://github.com/OpenNMT/CTranslate2/issues/1300
As an update - the latest versions of ctranslate2 do support this. As such, it's simply a matter of adding a flash_attention flag to https://github.com/SYSTRAN/faster-whisper/blob/4acdb5c619711eb9c0e1779e6fb1a6ff3d68d61b/faster_whisper/transcribe.py#L144
As per https://github.com/OpenNMT/CTranslate2/issues/1300#issuecomment-2047318118
I have verified this locally on my own copy of faster-whisper and it works just fine!
Is this completed @trungkienbkhn? I do not see the relevant commit. It seems like faster whisper needs an argument added to model construction.
@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:
from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)
Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html
@jet082 After enabling flash_attention, how much speed boost did you noticed?
@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:
from faster_whisper import WhisperModel model = WhisperModel('large-v3', device="cuda", flash_attention=True)Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html
My apologies, I was a version behind - this works great!
@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:
from faster_whisper import WhisperModel model = WhisperModel('large-v3', device="cuda", flash_attention=True)Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html
My apologies, I was a version behind - this works great!
Can you share the performance improvement on Flash Attention?
Regarding benchmarking faster-whisper with FA - I did not witness any speed boost when loading the faster whisper model with FA. I shared the full details including code snip in this issue that originated in ctranslate2 repo https://github.com/OpenNMT/CTranslate2/issues/1300.
In short, when transcribing ~50 sec audio with the default faster whisper params (beam equals 5) there is no difference if you use FA or not the inference timed measured were almost identical. Does anyone have noticed the same behavior? if not what am I doing wrong?
Thanks in advance,
@Purfview @BBC-Esq
Regarding benchmarking faster-whisper with FA - I did not witness any speed boost when loading the faster whisper model with FA. I shared the full details including code snip in this issue that originated in ctranslate2 repo OpenNMT/CTranslate2#1300.
In short, when transcribing ~50 sec audio with the default faster whisper params (beam equals 5) there is no difference if you use FA or not the inference timed measured were almost identical. Does anyone have noticed the same behavior? if not what am I doing wrong?
Thanks in advance,
@Purfview @BBC-Esq
Yes, I didn't notice any difference in speed too when FA enabled.
@Napuh maybe you can help with this?
@BBC-Esq thanks! I'm using a single sample for inference, I just wonder if the lack of performance is fundamental or origin in the implementation.
@Purfview @Napuh @trungkienbkhn Could you please assist here?
For information, I have executed benchmark for FlashAttention with GPU NVIDIA H100 and large-v3 model. Below are the results:
1. Speed benchmark: Processing audio with duration 13:19.231s Detected language 'fr' with probability 1.00
| System | Beam_size=5 | Beam_size=1 |
|---|---|---|
| Faster-Whisper | 34.512s | 27.190s |
| FW with FlashAttention | 33.751s | 26.607s |
2. WER benchmark: Dataset: librispeech_asr Number of audio used for evaluation: 500
| System | Beam_size=5 | Beam_size=1 |
|---|---|---|
| Faster-Whisper | 2.649 | 2.325 |
| FW with FlashAttention | 2.774 | 2.252 |
3. Memory benchmark: GPU name: NVIDIA H100 PCIe GPU device index: 0
| System | Maximum increase of RAM | Maximum GPU memory usage | Maximum GPU power usage |
|---|---|---|---|
| Faster-Whisper (beam_size=5) | 1257 MiB | 5178MiB / 81559MiB | 157W / 350W |
| FW with FlashAttention (beam_size=5) | 1251 MiB | 4954MiB / 81559MiB | 153W / 350W |
| Faster-Whisper (beam_size=1) | 1243 MiB | 4634MiB / 81559MiB | 164W / 350W |
| FW with FlashAttention (beam_size=1) | 1327 MiB | 4602MiB / 81559MiB | 164W / 350W |
=> Speed has improved a bit with FlashAttention Note that currently, FlashAttention only support for Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)
@trungkienbkhn Thank you for running this benchmark! Since we see marginal improvement (which can be explained by environment inconsistency), I wonder if this is the expected behavior or if there is an implementation issue.
Correct me if I'm wrong, but all of those results seem within the margin of error. Just to confirm, you're running ctranslate 4.3? 4.3.1 just came out but don't think it'd make a difference.
I executed with ctranslate2 4.2.1. But after discussing with @minhthuc2502, he confirmed that there is no difference when running fw between 4.2.1 and 4.3.0.
Since we see marginal improvement (which can be explained by environment inconsistency), I wonder if this is the expected behavior or if there is an implementation issue.
Yes it's expected behavior. There has been improvement, but not much. I think it might be because the number of input tokens is not large enough (maximum only 30 seconds per segment) and the fw large-v3 model is not too large in size (3GB).
I tried to test Phi3-128k to test whether FA works for extremely long prompt lengths, however, I couldn't get it to run correctly. Does anyone have a conversion of the Phi3-mini-128k or the Phi3-small-8k models by chance? It seems like I can convert the 128k mini phi3 even though it still throws an error when I try to run inference on it, but the small-8k version I can't even convert. Here's my conversion script for anyone's enjoyment. Please let me know. convert_ctranslate2.txt
Just FYI, converting these models requires ctranslate2==4.3+
You should post this question in issues of ctranslate2 repo
Hi guys,
I'm currently trying to use whisper with ct2 and flash attention as @trungkienbkhn's response. However, I always get this line "Flash attention 2 is not supported" when trying to inference some samples. Here is my environment:
- A6000, CUDA 12.3, cuDNN 9.0, Python 3.10
- Flash attention version 2.7.0.post2 (after using the default install line).
And these are my steps to run inference:
- Load whisper model using huggingface
- Convert to ct2 with this line
ct2-transformers-converter --model models/whisper-large-v3-turbo --output_dir custom-faster-whisper-large-v3-turbo \ --copy_files tokenizer.json preprocessor_config.json --quantization float16 - Finally I use
from faster_whisper import WhisperModelmodel = WhisperModel('./models/whisper-large-v3-turbo', device="cuda", compute_type = 'float16', flash_attention=True)
What could be the things that I had done incorrectly ? Please help !!! Thank you in advance <3
Hi guys,
I'm currently trying to use whisper with ct2 and flash attention as @trungkienbkhn's response. However, I always get this line "Flash attention 2 is not supported" when trying to inference some samples. Here is my environment:
- A6000, CUDA 12.3, cuDNN 9.0, Python 3.10
- Flash attention version 2.7.0.post2 (after using the default install line).
And these are my steps to run inference:
- Load whisper model using huggingface
- Convert to ct2 with this line
ct2-transformers-converter --model models/whisper-large-v3-turbo --output_dir custom-faster-whisper-large-v3-turbo \ --copy_files tokenizer.json preprocessor_config.json --quantization float16- Finally I use
from faster_whisper import WhisperModelmodel = WhisperModel('./models/whisper-large-v3-turbo', device="cuda", compute_type = 'float16', flash_attention=True)What could be the things that I had done incorrectly ? Please help !!! Thank you in advance <3
same problem here