gpt-fast issues

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

7

### Some context I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using - compile - tensor parallelism of 8 - int8 quantization...

jamestwhedbee

Does it support the reasoning acceleration of Qwen-14B?

4

Qwen-14B： https://github.com/QwenLM/Qwen

dashi6174

Downloads the whole hf repo

2

The scripts/download.py downloads the whole hf repo including the *.safetensors files, even though it seems only the *.bin files are required.

das-projects

Running ``quantize.py`` with ``--mode int4-gptq`` does not seem to work: - code tries to import ``lm-evaluation-harness`` which is not included/documented/used - import in ``eval.py`` is incorrect, should probably be ``from...

lopuhin

[example] changed int8 quantization to do fp8 weight-only quantization

3

In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor. I haven't done any...

Chillee

CLA Signed

Adding black and isort

3

chethanuk

CLA Signed

AMD quantize

5

trying to quantize and no model is generated my hardware is amd ```python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8 Loading model ... Quantizing model weights for int8 weight-only symmetric per-channel quantization...

rraulison

NameError: name 'InputRecorder' is not defined

2

NameError: name 'InputRecorder' is not defined getting this error while create int4 and int4-gptq model

MrD005

What's the input context length for the benchmark results?

1

With longer input length, the prefill phase latency would be higher, could you share the model input token count when obtaining the results in this post? https://pytorch.org/blog/accelerating-generative-ai-2/?utm_content=273712248&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024

YangZhou0417

gpt-fast
gpt-fast copied to clipboard

Metadata

Update README.md

Speculative decoding slows model down, possibly from "skipping cudagraphs due to ['mutated inputs']"?

Does it support the reasoning acceleration of Qwen-14B?

Downloads the whole hf repo

GPTQ quantization not working

[example] changed int8 quantization to do fp8 weight-only quantization

Adding black and isort

AMD quantize

NameError: name 'InputRecorder' is not defined

What's the input context length for the benchmark results?

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard