Horace He comments

Results 242 comments of


                                            Horace He

Will these optimization integrate into hf's code?

@SunMarc I think there might still be some gaps in how the kv-cache is handled during inference. Specifically, the link you sent is about vision models, not text generation. We...

Will these optimization integrate into hf's code?

@yhyu13 > https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100....

Does it support the reasoning acceleration of Qwen-14B?

It's similar to the llama architecture, so it should be easy to modify `model.py` to support it.

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

Oh sorry, this is a note I should add to the README. This repo currently cannot efficiently support using an int8 quantized model as the verifier model. Basically, Inductor can...

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

@jamestwhedbee There are a couple of scripts in the `scripts` that should result in speedups. In particular, you should try `./scripts/speculate_tp_70B_bf16.sh`. EDIT: There seems to be some kind of issue...

Compatible with AutoGPTQ?

It should share the same group size support and such. I’m not sure about activation order. One note is that for 4-bit support we do require the weights to be...

[example] changed int8 quantization to do fp8 weight-only quantization

It's just an example PR - not intending to merge it.

What's the input context length for the benchmark results?

Low, maybe 5 tokens?

AMD quantize

Nothing is generated in the model folder? Can you provide more details on what's being printed?

AMD quantize

The performance here is a lot lower than I'd expect. What GPU are you using? As for the quantization note, perhaps the issue is that you're running out of CPU...