Jiyuan Qian comments

Repositories
Issues
Comments

Results 22 comments of


                                            Jiyuan Qian

Falcon 40B slow inference

> Wait for this to land: #438 so you can use a better latency kernel (GPTQ) Hi @Narsil this is really exciting! do you have any early numbers to share...

Falcon 40B slow inference

I see. Previously I tried quantization on falcon-7b, and got 58ms per token with bitsandbytes, while without quantization it was 31ms per token. If GPTQ can be as fast as...