gpt-fast icon indicating copy to clipboard operation
gpt-fast copied to clipboard

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

Results 132 gpt-fast issues
Sort by recently updated
recently updated
newest added

Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.

CLA Signed

Hi, Im trying to get an example working with Ray on Databricks. Essentially having multiple replicas of the model. Is it possible to load a model with tensor parallelism inside...

Perf numbers of Llama3-8B implementation added by https://github.com/pytorch-labs/gpt-fast/pull/158

CLA Signed

gpt-fast will use `torch.load` with `mmap=True` to load checkpoints of models. This may help speed up model load time. However, eventually, mmap is not used in bf16, because in https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L247,...

Improve the code quality, the `empty` variable is redundant. This change removes it

CLA Signed

INT8 quantization works fine, but INT4 does not work. ![Capture](https://github.com/pytorch-labs/gpt-fast/assets/106262476/ac10df53-860e-4da9-b51e-1ad17e3fe3c4)

Download the tinyllamas' weight from https://huggingface.co/karpathy/tinyllamas/tree/main Download the tinyllamas' `tokenizer.model` from https://github.com/karpathy/llama2.c/raw/master/tokenizer.model

CLA Signed

`n_local_heads` refers to TP sharding, rather than GQA.

Currently the code only supports bs=1 with input_pos being one dimensional. This fixes input_pos shape in the comments.

CLA Signed

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #155 Summary: hqq wikitext: {'word_perplexity,none': 12.698986130023261, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6084602387562144, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6856802729143467, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} not hqq wikitext: {'word_perplexity,none':...

CLA Signed