jgcb00
jgcb00
Yes but at least it's running, 3/60 layers converted also we cannot specify which precision ? it's `int4` by default ? can it be `int8` ?
Hi, for anyone passing by and having this error : Another way to solve it you just have to add to your guidance the argument `async_mode=True` and use it as...
Same error here, and when I try to build flash attention, it takes days litteraly...
Hi, New implementation was release : https://tridao.me/publications/flash2/flash2.pdf With 50% TFLOPS improvement on the forward pass comparing to the old FlashAttention implementation, and massive improvement comparing to the vanilla Attention mecanism...
Hi, I think the V2 will be much simpler to implement as it comes with an higher level library and much more compatible GPU. It might also restrict the gpus...
Hi my thought on this, they are some major pros and some cons : Pro : - Reduce VRAM usage, - Flash-decoding improve speed on long sequence generation (Don't know...
So I have also unexpected results, Here I'm testing with `num_hypotheses` of 1, and increasing the batch size, with several different GPU, and several different Llama 2 models, I can...
Hi, The Falcon model is pretty bad when asking very small prompt, like hi, hello etc... you often get exactly that kind of output. If you ask a longer question,...
Using only hugging face : I got the same result with `load_in_8bit=True` : ``` Question: hi Answer: (4). 'I don't think I'll ever be able to forget you.' ``` or...
`einops` is only used by the falcon model, it should not be a requirement for the package