BBC-Esq
BBC-Esq
Interested in this as an attorney and extracting numerous legal citations...
Quick question for you sir. You realize that flash attention 2 only works certain gpus, correct? If a person tries to run the model without the required GPU, will the...
Hey @sanchit-gandhi here's updated comparisons. Feel free to let me know how to cast in float16/bfloat16 if you want and/or use bitsandbytes or whatever this type of model is compatible...
The WhisperSpeech library uses two types of models, s2a and t2s and there are multiples of each, so this benchmark tests every permutation/combination.
Awesome, thanks dude! Don't know why I didn't realize that. lol. Anyways, here's the updated bench. About the same processing time but about 30% less VRAM used. At a certain...
@ylacombe here's the updated benchmark to include the new Large model, which, congratulations on BTW to Huggingface. A quick disclaimer... I'm giving two charts - one showing vram usage and...
OK, let me retry it...thanks.
Strange...it did the same thing again. Below I am including (1) the full response, (2) the command I used to run the script, (3) a modified script I created that...
I solved the issue by using the "end_token" parameter. Here's the script for peoples' benefit: ``` class Llama38BInstructModel: def __init__(self, user_prompt="PLACEHOLDER_FOR_USER_PROMPT", system_prompt="You are a helpful assistant who answers questions in...
Absolutely! So glad you asked! lol. Ctranslate2 actually does support true batching, but at the C++ level. I'll give you my repository that uses it via the amazing ```WhisperS2T``` as...