mobicham comments

Results 113 comments of


                                            mobicham

[BUG]: installation

I have a similar problem, but it mentions **glad**, I tried twice, the second time in a freshly setup conda environment but it still throws the same error:

Whisper Static Cache

Thank you very much @huseinzol05 for the work. Here's a version with HQQ 4-bit using the torchao backend. As expected there's a good speed-up with the static cache and fullgraph...

Add static cache support for Whisper

@huseinzol05 great, thanks ! I think you also need to make sure the model supports initializing the static cache via `_setup_cache`: ```Python from transformers import StaticCache model._setup_cache(StaticCache, batch_size, max_cache_len=max_cache_length) ```

Add static cache support for Whisper

Maybe you can use `arange` instead like here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L964-L966

Add static cache support for Whisper

Great :+1: ! But that `arange` works well in Llama with fullgraph torch compile.

Diffusion Transformers quantization

@kabachuha have you tried hqq? Happy to assist if you need help to make it work.

HQQ quantization support

@mgoin We had a hacky version working with an older version of vLLM just as a proof-of-concept and it was working, but we need to remove it because it's deprecated...

Diffusion Transformers quantization

Nice work @Lucky-Lance !

Whisper Static Cache

Any progress on this folks? Is there a timeline for a general static support in transformers? We are very excited to see this officially supported in transformers!

Marlin slower than fp16 on larger batches

Thanks for your answer @efrantar . Understood. I am trying to integrate it with our quantization method, below the benchmarks for the forward pass on an 3090, Llama2-7B, batch-size=1, context-size=2048:...