Horace He
Horace He
@SunMarc I think there might still be some gaps in how the kv-cache is handled during inference. Specifically, the link you sent is about vision models, not text generation. We...
@yhyu13 > https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100....
It's similar to the llama architecture, so it should be easy to modify `model.py` to support it.
Oh sorry, this is a note I should add to the README. This repo currently cannot efficiently support using an int8 quantized model as the verifier model. Basically, Inductor can...
@jamestwhedbee There are a couple of scripts in the `scripts` that should result in speedups. In particular, you should try `./scripts/speculate_tp_70B_bf16.sh`. EDIT: There seems to be some kind of issue...
It should share the same group size support and such. I’m not sure about activation order. One note is that for 4-bit support we do require the weights to be...
It's just an example PR - not intending to merge it.
Low, maybe 5 tokens?
Nothing is generated in the model folder? Can you provide more details on what's being printed?
The performance here is a lot lower than I'd expect. What GPU are you using? As for the quantization note, perhaps the issue is that you're running out of CPU...