BBC-Esq

Results 104 comments of BBC-Esq

BTW, just haven't had the time to update my whispers2t batch repo with this bad boy so stay tuned. ;-) ![image](https://github.com/OpenNMT/CTranslate2/assets/108230321/5d51a770-d412-49e7-b1db-fc58e10aff4c) It allows you to specify task, choose any ctranslate2...

Last post I promise...but here's my analysis of ```WhisperS2T```. I believe my repo uses a traditional "loop" to process using WhisperS2T...but you can also send a batch of information directly...

Whisper S2T uses ctransalate 2 directly basically.

Updated graph here including llama.cpp, which, apparently, is faster but uses slightly more VRAM...except for the 13B model where it's 3GBhigher. Plus, the numbers changed somewhat because I ran each...

Completed graphs below. A few highlights: 1) For unknown reasons, neural chat without flash attention has an unexpectedly high tokens per second at beam size 4, although its VRAM usage...

> Hello, What is the average seq_length in your benchmark? The flash attention have a better performance for the long prompt only. Great question! I don't know if by ```seq_length```...

I tested all of the models yet again this morning one right after the other, without opening closing other programs, etc. Here's the results...Seems to confirm yet again the unique...

To further illustrate...here is a chart for ```transformers``` + ```bitsandbytes``` running in 4-bit mode, which can use a ```beam_size``` parameter (GGUF cannot). Overall, you see the same behavior regarding less...

> I means number of token of input. I would be great to compare with and without FA2 with the prompt's size from 1000 to 3000 tokens. I think the...

Running llama2-13b with flash attention on and off lead the the same result as with llama2-7b...Only a miniscule advantage of using flash attention, nowhere near the advantages with mistral-based models...