gpt-fast
gpt-fast copied to clipboard
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.
Hi, Im trying to get an example working with Ray on Databricks. Essentially having multiple replicas of the model. Is it possible to load a model with tensor parallelism inside...
Perf numbers of Llama3-8B implementation added by https://github.com/pytorch-labs/gpt-fast/pull/158
gpt-fast will use `torch.load` with `mmap=True` to load checkpoints of models. This may help speed up model load time. However, eventually, mmap is not used in bf16, because in https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L247,...
Improve the code quality, the `empty` variable is redundant. This change removes it
INT8 quantization works fine, but INT4 does not work. 
Download the tinyllamas' weight from https://huggingface.co/karpathy/tinyllamas/tree/main Download the tinyllamas' `tokenizer.model` from https://github.com/karpathy/llama2.c/raw/master/tokenizer.model
`n_local_heads` refers to TP sharding, rather than GQA.
Currently the code only supports bs=1 with input_pos being one dimensional. This fixes input_pos shape in the comments.
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #155 Summary: hqq wikitext: {'word_perplexity,none': 12.698986130023261, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6084602387562144, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6856802729143467, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} not hqq wikitext: {'word_perplexity,none':...