mobicham
mobicham
I have a similar problem, but it mentions **glad**, I tried twice, the second time in a freshly setup conda environment but it still throws the same error:
Thank you very much @huseinzol05 for the work. Here's a version with HQQ 4-bit using the torchao backend. As expected there's a good speed-up with the static cache and fullgraph...
@huseinzol05 great, thanks ! I think you also need to make sure the model supports initializing the static cache via `_setup_cache`: ```Python from transformers import StaticCache model._setup_cache(StaticCache, batch_size, max_cache_len=max_cache_length) ```
Maybe you can use `arange` instead like here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L964-L966
Great :+1: ! But that `arange` works well in Llama with fullgraph torch compile.
@kabachuha have you tried hqq? Happy to assist if you need help to make it work.
@mgoin We had a hacky version working with an older version of vLLM just as a proof-of-concept and it was working, but we need to remove it because it's deprecated...
Nice work @Lucky-Lance !
Any progress on this folks? Is there a timeline for a general static support in transformers? We are very excited to see this officially supported in transformers!
Thanks for your answer @efrantar . Understood. I am trying to integrate it with our quantization method, below the benchmarks for the forward pass on an 3090, Llama2-7B, batch-size=1, context-size=2048:...