exllama
exllama copied to clipboard
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
https://github.com/vllm-project/vllm just released publicly, claiming to be an inference library that accelerates HF Transformers by 24x
So my P40 is only using about 70W while generating responses, its not limited in any way(IE. Power delivery or temp)
Hi! While 3-bit and 2-bit quantisations are obviously less popular than 4-bit quantisations, I'm looking into the possibility of loading 13B models with 8 GB of VRAM. So far, loading...
Related issue (created by me): https://github.com/turboderp/exllama/issues/103
exllama_ext = load( name = extension_name, sources = [ os.path.join(library_dir, "exllama_ext/exllama_ext.cpp"), os.path.join(library_dir, "exllama_ext/cuda_buffers.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_matrix.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_matmul.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/column_remap.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/rms_norm.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/rope.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/half_matmul.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_attn.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_mlp.cu"), os.path.join(library_dir,...
Lora support
Congrats and thank you again for a project that changes everything. Can't use anything else and now I even prefer your Web UI to the std. text-web-ui... In some instances...
I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words...
``` RuntimeError Traceback (most recent call last) Cell In[3], line 4 2 config.model_path = model_path 3 config.max_seq_len = 2048 ----> 4 model = ExLlama(config) 5 cache = ExLlamaCache(model) 6 tokenizer...
So I could just send a simple request and get simple response in a free-form mode, without any additional context
Performance when generating with top_p = 1.0 is about 3x slower than with top_p at anything(?) else, to duplicate try 0.99 and 1.0. I've seen this bug with both the...