mobicham
mobicham
Hello! Are there some resources that explain how the quantized parameters are structured in a GGUF file? We are interested in porting HQQ-quantized models into GGUF format, but in order...
### Feature request Would be great to have static cache support for Whisper to make it faster with `torch.compile`. Currently, the `generate()` function doesn't support `cache_implementation="static"` for Whisper. ### Motivation...
I have been making some benchmarks with Marlin, but the speed-up is far from what is reported. In fact, it's actually slower than fp16: GPU: A6000 ada ``` matrix_shape: [11008,...
HQQ multi-gpu support is so far only supported for the `quantize_model` model call.
### System Info - `transformers` version: 4.41.1 - Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35 - Python version: 3.10.13 - Huggingface_hub version: 0.23.2 - Accelerate version: 0.30.1 - PyTorch version (GPU?): 2.4.0.dev20240527+cu121 (True) ### Who...
Evaluation of gguf models via llama_cpp server is extremely slow. All the layers are offloaded to the GPU so normally it should work fine, but truthfulqa takes 10 hours, it...
Follow-up to https://github.com/huggingface/transformers/pull/32379
### 🐛 Describe the bug I noticed that the latest stable release 2.5.0 is slower than 2.4.1 when using torch.compile (reduce-overhead), I tried on different machines with a 4090 RTX...
### 🐛 Describe the bug torch.compile breaks with Triton built from source (as of Nov 12): How to reproduce: 1) Build Triton from the master branch 2) Run torch.compile with...
I noticed that FP8E4 `tl.dot` is not supported on AMD, but FP8E5 works. Any plans to add FP8E4 support for AMD? ``` LLVM ERROR: No match found in MFMA database...