mobicham

Results 11 issues of mobicham

Hello! Are there some resources that explain how the quantized parameters are structured in a GGUF file? We are interested in porting HQQ-quantized models into GGUF format, but in order...

### Feature request Would be great to have static cache support for Whisper to make it faster with `torch.compile`. Currently, the `generate()` function doesn't support `cache_implementation="static"` for Whisper. ### Motivation...

Feature request
Audio

I have been making some benchmarks with Marlin, but the speed-up is far from what is reported. In fact, it's actually slower than fp16: GPU: A6000 ada ``` matrix_shape: [11008,...

HQQ multi-gpu support is so far only supported for the `quantize_model` model call.

enhancement

### System Info - `transformers` version: 4.41.1 - Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35 - Python version: 3.10.13 - Huggingface_hub version: 0.23.2 - Accelerate version: 0.30.1 - PyTorch version (GPU?): 2.4.0.dev20240527+cu121 (True) ### Who...

Evaluation of gguf models via llama_cpp server is extremely slow. All the layers are offloaded to the GPU so normally it should work fine, but truthfulqa takes 10 hours, it...

bug

Follow-up to https://github.com/huggingface/transformers/pull/32379

### 🐛 Describe the bug I noticed that the latest stable release 2.5.0 is slower than 2.4.1 when using torch.compile (reduce-overhead), I tried on different machines with a 4090 RTX...

module: performance
oncall: pt2

### 🐛 Describe the bug torch.compile breaks with Triton built from source (as of Nov 12): How to reproduce: 1) Build Triton from the master branch 2) Run torch.compile with...

high priority
triage review
oncall: pt2
upstream triton

I noticed that FP8E4 `tl.dot` is not supported on AMD, but FP8E5 works. Any plans to add FP8E4 support for AMD? ``` LLVM ERROR: No match found in MFMA database...