exllama
exllama copied to clipboard
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Has anyone gotten 16k context length with codellama or llama2? because i have tried multiple models but they all start producing gibberish when the context window gets past 4096. I...
I want to remove tokens that exceed the max_seq_len. How can I achieve this functionality?
Where is it being done in the code?
I have a machine with Mi25 GPUs. Would anybody like SSH access to develop on it for exllama?
The current version of CUDA allows you to access the component halfs of half2 through half2.x and half2.y, but in HIP x and y are unsigned shorts and not half...
I have a gpu that I want to load multiple model in it. Your exllama model is loading all weight to gpu after instantiate the `ExLlama`. Is it possible if...
Have you tried this yet? https://github.com/InternLM/lmdeploy On my initial testing for 7B and 13B models there's a noticeable per-token latency improvement (measured in time to generate the first 5 tokens).
Trying to learn more about the optimizations.
I'm new to exllama, are there any tutorials on how to use this? I'm trying this with the llama-2 70b model.
exllama/model.py", line 45, in __init__ self.pad_token_id = read_config["pad_token_id"] KeyError: 'pad_token_id'