jgcb00 comments

Results 30 comments of


                                            jgcb00

Inference support for GPTQ (llama + falcon tested) + Quantization script

Yes but at least it's running, 3/60 layers converted also we cannot specify which precision ? it's `int4` by default ? can it be `int8` ?

Running guidance with FastAPI

Hi, for anyone passing by and having this error : Another way to solve it you just have to add to your guidance the argument `async_mode=True` and use it as...

NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Falcon-40b-instruct

Same error here, and when I try to build flash attention, it takes days litteraly...

Request to support FlashAttention in cuda attention.cc

Hi, New implementation was release : https://tridao.me/publications/flash2/flash2.pdf With 50% TFLOPS improvement on the forward pass comparing to the old FlashAttention implementation, and massive improvement comparing to the vanilla Attention mecanism...

Request to support FlashAttention in cuda attention.cc

Hi, I think the V2 will be much simpler to implement as it comes with an higher level library and much more compatible GPU. It might also restrict the gpus...

Request to support FlashAttention in cuda attention.cc

Hi my thought on this, they are some major pros and some cons : Pro : - Reduce VRAM usage, - Flash-decoding improve speed on long sequence generation (Don't know...

Weird speed behavior int8* quantization

So I have also unexpected results, Here I'm testing with `num_hypotheses` of 1, and increasing the batch size, with several different GPU, and several different Llama 2 models, I can...

Falcon 40B : too slow and random answers

Hi, The Falcon model is pretty bad when asking very small prompt, like hi, hello etc... you often get exactly that kind of output. If you ask a longer question,...

Falcon 40B : too slow and random answers

Using only hugging face : I got the same result with `load_in_8bit=True` : ``` Question: hi Answer: (4). 'I don't think I'll ever be able to forget you.' ``` or...

Falcon 40B : too slow and random answers

`einops` is only used by the falcon model, it should not be a requirement for the package