gpt-fast
gpt-fast copied to clipboard
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
### Some context I am using AMD MI100 GPUs and I can get ~33 tokens/second for Llama 2 70B using - compile - tensor parallelism of 8 - int8 quantization...
Qwen-14B: https://github.com/QwenLM/Qwen
The scripts/download.py downloads the whole hf repo including the *.safetensors files, even though it seems only the *.bin files are required.
Running ``quantize.py`` with ``--mode int4-gptq`` does not seem to work: - code tries to import ``lm-evaluation-harness`` which is not included/documented/used - import in ``eval.py`` is incorrect, should probably be ``from...
In this case I'm guessing that for fp8 you might not need a scale parameter for the weights, since each weight has its own scaling factor. I haven't done any...
trying to quantize and no model is generated my hardware is amd ```python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8 Loading model ... Quantizing model weights for int8 weight-only symmetric per-channel quantization...
NameError: name 'InputRecorder' is not defined getting this error while create int4 and int4-gptq model
With longer input length, the prefill phase latency would be higher, could you share the model input token count when obtaining the results in this post? https://pytorch.org/blog/accelerating-generative-ai-2/?utm_content=273712248&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024