faster-whisper icon indicating copy to clipboard operation
faster-whisper copied to clipboard

Incorporating flash-attention2 [SOLVED] and subsequent testing [ONGOING]

Open BBC-Esq opened this issue 2 years ago • 21 comments

Hello all. Just thought I'd post a question about Flash Attention 2 here:

https://github.com/Dao-AILab/flash-attention

Apparently it's making big waves and seems seems very powerful. Does anyone plan on seeing if it's something that could be included?

BBC-Esq avatar Nov 30 '23 12:11 BBC-Esq

not faster whisper issue

require a lot of new c++ code into ctranslate2, so highly doubt devs have time

phineas-pta avatar Nov 30 '23 20:11 phineas-pta

ctranslate2 is backend of faster-whisper, all heavy computations are done in ctranslate2

same for pytorch is backend of openai-whisper

phineas-pta avatar Nov 30 '23 20:11 phineas-pta

Before talking about adopting flash attention, it is necessary to understand what flash attention v2 does and how it helps. Flash attention is an library which reorganises the workflow, parallelise workflow and reducing the number of shifts in memory. This requires the implementer to have sufficient knowledge about the memory management of the gpu and even the original implementer of flash attention v2 only managed to make it compatible for certain gpus as of today. Rather than focusing on making flash attention library compatible with ctranslate2 which may or may not help with the speedup, it would make more sense to improve ctranslate2 further.

blackpolarz avatar Dec 01 '23 11:12 blackpolarz

In some way, you can think of faster-whisper as the customer support where they process your queries to the relevant departments. While you can make a more streamlined customer support or an all rounded customer service, the actual technical work done still depends on the technical team behind the scenes. In this case, ctranslate2 is the technical team. I won't exactly say that faster-whisper doesn't add any functionality. It actually does a bit of preprocessing before putting it into a format where ctranslate2 can understand, then gather whatever ctranslate2 returns to put it in a format where we as users can understand. This is actually different from the insanely faster whisper repo, where they only added CLI support to make it easier for users to use or understand.

blackpolarz avatar Dec 01 '23 12:12 blackpolarz

there is some discussion here on ctranslate2: https://github.com/OpenNMT/CTranslate2/issues/1300

junchen6072 avatar Dec 08 '23 08:12 junchen6072

As an update - the latest versions of ctranslate2 do support this. As such, it's simply a matter of adding a flash_attention flag to https://github.com/SYSTRAN/faster-whisper/blob/4acdb5c619711eb9c0e1779e6fb1a6ff3d68d61b/faster_whisper/transcribe.py#L144

As per https://github.com/OpenNMT/CTranslate2/issues/1300#issuecomment-2047318118

I have verified this locally on my own copy of faster-whisper and it works just fine!

jet082 avatar May 17 '24 21:05 jet082

Is this completed @trungkienbkhn? I do not see the relevant commit. It seems like faster whisper needs an argument added to model construction.

jet082 avatar May 20 '24 04:05 jet082

@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:

from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)

Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html

trungkienbkhn avatar May 20 '24 06:05 trungkienbkhn

@jet082 After enabling flash_attention, how much speed boost did you noticed?

twicer-is-coder avatar May 20 '24 17:05 twicer-is-coder

@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:

from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)

Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html

My apologies, I was a version behind - this works great!

jet082 avatar May 21 '24 08:05 jet082

@jet082 , It's supported from version 1.0.2 and is related to this PR. This PR allows adding new model args from ctranslate2 to the FW, even if they have not been defined in the FW. You can try this implementation:

from faster_whisper import WhisperModel
model = WhisperModel('large-v3', device="cuda", flash_attention=True)

Ctranslate2 docs: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html

My apologies, I was a version behind - this works great!

Can you share the performance improvement on Flash Attention?

twicer-is-coder avatar May 21 '24 09:05 twicer-is-coder

Regarding benchmarking faster-whisper with FA - I did not witness any speed boost when loading the faster whisper model with FA. I shared the full details including code snip in this issue that originated in ctranslate2 repo https://github.com/OpenNMT/CTranslate2/issues/1300.

In short, when transcribing ~50 sec audio with the default faster whisper params (beam equals 5) there is no difference if you use FA or not the inference timed measured were almost identical. Does anyone have noticed the same behavior? if not what am I doing wrong?

Thanks in advance,

@Purfview @BBC-Esq

AvivSham avatar Jun 06 '24 10:06 AvivSham

Regarding benchmarking faster-whisper with FA - I did not witness any speed boost when loading the faster whisper model with FA. I shared the full details including code snip in this issue that originated in ctranslate2 repo OpenNMT/CTranslate2#1300.

In short, when transcribing ~50 sec audio with the default faster whisper params (beam equals 5) there is no difference if you use FA or not the inference timed measured were almost identical. Does anyone have noticed the same behavior? if not what am I doing wrong?

Thanks in advance,

@Purfview @BBC-Esq

Yes, I didn't notice any difference in speed too when FA enabled.

twicer-is-coder avatar Jun 06 '24 11:06 twicer-is-coder

@Napuh maybe you can help with this?

AvivSham avatar Jun 06 '24 19:06 AvivSham

@BBC-Esq thanks! I'm using a single sample for inference, I just wonder if the lack of performance is fundamental or origin in the implementation.

@Purfview @Napuh @trungkienbkhn Could you please assist here?

AvivSham avatar Jun 09 '24 11:06 AvivSham

For information, I have executed benchmark for FlashAttention with GPU NVIDIA H100 and large-v3 model. Below are the results:

1. Speed benchmark: Processing audio with duration 13:19.231s Detected language 'fr' with probability 1.00

System Beam_size=5 Beam_size=1
Faster-Whisper 34.512s 27.190s
FW with FlashAttention 33.751s 26.607s

2. WER benchmark: Dataset: librispeech_asr Number of audio used for evaluation: 500

System Beam_size=5 Beam_size=1
Faster-Whisper 2.649 2.325
FW with FlashAttention 2.774 2.252

3. Memory benchmark: GPU name: NVIDIA H100 PCIe GPU device index: 0

System Maximum increase of RAM Maximum GPU memory usage Maximum GPU power usage
Faster-Whisper (beam_size=5) 1257 MiB 5178MiB / 81559MiB 157W / 350W
FW with FlashAttention (beam_size=5) 1251 MiB 4954MiB / 81559MiB 153W / 350W
Faster-Whisper (beam_size=1) 1243 MiB 4634MiB / 81559MiB 164W / 350W
FW with FlashAttention (beam_size=1) 1327 MiB 4602MiB / 81559MiB 164W / 350W

=> Speed has improved a bit with FlashAttention Note that currently, FlashAttention only support for Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100)

trungkienbkhn avatar Jun 10 '24 11:06 trungkienbkhn

@trungkienbkhn Thank you for running this benchmark! Since we see marginal improvement (which can be explained by environment inconsistency), I wonder if this is the expected behavior or if there is an implementation issue.

AvivSham avatar Jun 10 '24 13:06 AvivSham

Correct me if I'm wrong, but all of those results seem within the margin of error. Just to confirm, you're running ctranslate 4.3? 4.3.1 just came out but don't think it'd make a difference.

I executed with ctranslate2 4.2.1. But after discussing with @minhthuc2502, he confirmed that there is no difference when running fw between 4.2.1 and 4.3.0.

Since we see marginal improvement (which can be explained by environment inconsistency), I wonder if this is the expected behavior or if there is an implementation issue.

Yes it's expected behavior. There has been improvement, but not much. I think it might be because the number of input tokens is not large enough (maximum only 30 seconds per segment) and the fw large-v3 model is not too large in size (3GB).

trungkienbkhn avatar Jun 10 '24 15:06 trungkienbkhn

I tried to test Phi3-128k to test whether FA works for extremely long prompt lengths, however, I couldn't get it to run correctly. Does anyone have a conversion of the Phi3-mini-128k or the Phi3-small-8k models by chance? It seems like I can convert the 128k mini phi3 even though it still throws an error when I try to run inference on it, but the small-8k version I can't even convert. Here's my conversion script for anyone's enjoyment. Please let me know. convert_ctranslate2.txt

Just FYI, converting these models requires ctranslate2==4.3+

You should post this question in issues of ctranslate2 repo

trungkienbkhn avatar Jun 16 '24 15:06 trungkienbkhn

Hi guys,

I'm currently trying to use whisper with ct2 and flash attention as @trungkienbkhn's response. However, I always get this line "Flash attention 2 is not supported" when trying to inference some samples. Here is my environment:

  • A6000, CUDA 12.3, cuDNN 9.0, Python 3.10
  • Flash attention version 2.7.0.post2 (after using the default install line).

And these are my steps to run inference:

  • Load whisper model using huggingface
  • Convert to ct2 with this line ct2-transformers-converter --model models/whisper-large-v3-turbo --output_dir custom-faster-whisper-large-v3-turbo \ --copy_files tokenizer.json preprocessor_config.json --quantization float16
  • Finally I use from faster_whisper import WhisperModel model = WhisperModel('./models/whisper-large-v3-turbo', device="cuda", compute_type = 'float16', flash_attention=True)

What could be the things that I had done incorrectly ? Please help !!! Thank you in advance <3

davidan208 avatar Nov 17 '24 04:11 davidan208

Hi guys,

I'm currently trying to use whisper with ct2 and flash attention as @trungkienbkhn's response. However, I always get this line "Flash attention 2 is not supported" when trying to inference some samples. Here is my environment:

  • A6000, CUDA 12.3, cuDNN 9.0, Python 3.10
  • Flash attention version 2.7.0.post2 (after using the default install line).

And these are my steps to run inference:

  • Load whisper model using huggingface
  • Convert to ct2 with this line ct2-transformers-converter --model models/whisper-large-v3-turbo --output_dir custom-faster-whisper-large-v3-turbo \ --copy_files tokenizer.json preprocessor_config.json --quantization float16
  • Finally I use from faster_whisper import WhisperModel model = WhisperModel('./models/whisper-large-v3-turbo', device="cuda", compute_type = 'float16', flash_attention=True)

What could be the things that I had done incorrectly ? Please help !!! Thank you in advance <3

same problem here

virtualmartire avatar Dec 04 '24 18:12 virtualmartire