mobicham issues

Results 11 issues of


                                            mobicham

GGUF quantization meta-data format

Hello! Are there some resources that explain how the quantized parameters are structured in a GGUF file? We are interested in porting HQQ-quantized models into GGUF format, but in order...

Add static cache support for Whisper

### Feature request Would be great to have static cache support for Whisper to make it faster with `torch.compile`. Currently, the `generate()` function doesn't support `cache_implementation="static"` for Whisper. ### Motivation...

Feature request

Audio

Marlin slower than fp16 on larger batches

I have been making some benchmarks with Marlin, but the speed-up is far from what is reported. In fact, it's actually slower than fp16: GPU: A6000 ada ``` matrix_shape: [11008,...

Add multi-gpu support for `from_quantized` call

HQQ multi-gpu support is so far only supported for the `quantize_model` model call.

enhancement

AttributeError: 'LlamaForCausalLM' object has no attribute '_setup_cache'

### System Info - `transformers` version: 4.41.1 - Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35 - Python version: 3.10.13 - Huggingface_hub version: 0.23.2 - Accelerate version: 0.30.1 - PyTorch version (GPU?): 2.4.0.dev20240527+cu121 (True) ### Who...

evaluation extremely slow with llama_cpp/gguf

Evaluation of gguf models via llama_cpp server is extremely slow. All the layers are offloaded to the GPU so normally it should work fine, but truthfulqa takes 10 hours, it...

bug

Hqq serialization

Follow-up to https://github.com/huggingface/transformers/pull/32379

torch 2.5 slower than 2.4.1 ?

### 🐛 Describe the bug I noticed that the latest stable release 2.5.0 is slower than 2.4.1 when using torch.compile (reduce-overhead), I tried on different machines with a 4090 RTX...

module: performance

oncall: pt2

[triton 3.2] std::bad_alloc: torch.compile breaks with Triton built from source

### 🐛 Describe the bug torch.compile breaks with Triton built from source (as of Nov 12): How to reproduce: 1) Build Triton from the master branch 2) Run torch.compile with...

high priority

triage review

oncall: pt2

upstream triton

FP8E4 tl.dot support for AMD

I noticed that FP8E4 `tl.dot` is not supported on AMD, but FP8E5 works. Any plans to add FP8E4 support for AMD? ``` LLVM ERROR: No match found in MFMA database...