BBC-Esq comments

Results 104 comments of


                                            BBC-Esq

Whisper batch generation is not faster than loops

BTW, just haven't had the time to update my whispers2t batch repo with this bad boy so stay tuned. ;-) ![image](https://github.com/OpenNMT/CTranslate2/assets/108230321/5d51a770-d412-49e7-b1db-fc58e10aff4c) It allows you to specify task, choose any ctranslate2...

Whisper batch generation is not faster than loops

Last post I promise...but here's my analysis of ```WhisperS2T```. I believe my repo uses a traditional "loop" to process using WhisperS2T...but you can also send a batch of information directly...

Whisper batch generation is not faster than loops

Whisper S2T uses ctransalate 2 directly basically.

Benchmarking common LLMs on ctranslate2, llama.cpp, and bitsandbytes

Updated graph here including llama.cpp, which, apparently, is faster but uses slightly more VRAM...except for the 13B model where it's 3GBhigher. Plus, the numbers changed somewhat because I ran each...

BENCHmarking new flash attention!

Completed graphs below. A few highlights: 1) For unknown reasons, neural chat without flash attention has an unexpectedly high tokens per second at beam size 4, although its VRAM usage...

BENCHmarking new flash attention!

> Hello, What is the average seq_length in your benchmark? The flash attention have a better performance for the long prompt only. Great question! I don't know if by ```seq_length```...

BENCHmarking new flash attention!

I tested all of the models yet again this morning one right after the other, without opening closing other programs, etc. Here's the results...Seems to confirm yet again the unique...

BENCHmarking new flash attention!

To further illustrate...here is a chart for ```transformers``` + ```bitsandbytes``` running in 4-bit mode, which can use a ```beam_size``` parameter (GGUF cannot). Overall, you see the same behavior regarding less...

BENCHmarking new flash attention!

> I means number of token of input. I would be great to compare with and without FA2 with the prompt's size from 1000 to 3000 tokens. I think the...

BENCHmarking new flash attention!

Running llama2-13b with flash attention on and off lead the the same result as with llama2-7b...Only a miniscule advantage of using flash attention, nowhere near the advantages with mistral-based models...