gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Results 132 gpt-fast issues
Sort by recently updated
recently updated
newest added

### Some context I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using - compile - tensor parallelism of 8 - int8 quantization...

The scripts/download.py downloads the whole hf repo including the *.safetensors files, even though it seems only the *.bin files are required.

Running ``quantize.py`` with ``--mode int4-gptq`` does not seem to work: - code tries to import ``lm-evaluation-harness`` which is not included/documented/used - import in ``eval.py`` is incorrect, should probably be ``from...

In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor. I haven't done any...

CLA Signed

trying to quantize and no model is generated my hardware is amd ```python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8 Loading model ... Quantizing model weights for int8 weight-only symmetric per-channel quantization...

NameError: name 'InputRecorder' is not defined getting this error while create int4 and int4-gptq model

With longer input length, the prefill phase latency would be higher, could you share the model input token count when obtaining the results in this post? https://pytorch.org/blog/accelerating-generative-ai-2/?utm_content=273712248&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024