exllama issues

Benchmarks vs vLLM?

6

https://github.com/vllm-project/vllm just released publicly, claiming to be an inference library that accelerates HF Transformers by 24x

nikshepsvn

Tesla P40 only using 70W underload

15

So my P40 is only using about 70W while generating responses, its not limited in any way(IE. Power delivery or temp)

TimyIsCool

3-bit and 2-bit GPTQ support

23

Hi! While 3-bit and 2-bit quantisations are obviously less popular than 4-bit quantisations, I'm looking into the possibility of loading 13B models with 8 GB of VRAM. So far, loading...

TechnotechGit

Adds the possibility to influence prediction with bias

3

Related issue (created by me): https://github.com/turboderp/exllama/issues/103

paolorechia

ImportError: DLL load failed while importing exllama_ext: 找不到指定的模块。

6

exllama_ext = load( name = extension_name, sources = [ os.path.join(library_dir, "exllama_ext/exllama_ext.cpp"), os.path.join(library_dir, "exllama_ext/cuda_buffers.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_matrix.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_matmul.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/column_remap.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/rms_norm.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/rope.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/half_matmul.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_attn.cu"), os.path.join(library_dir, "exllama_ext/cuda_func/q4_mlp.cu"), os.path.join(library_dir,...

onexixi

Lora support

18

Congrats and thank you again for a project that changes everything. Can't use anything else and now I even prefer your Web UI to the std. text-web-ui... In some instances...

alain40

Very poor output quality

55

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words...

calebmor460

RuntimeError: CUDA error: an illegal memory access was encountered

11

``` RuntimeError Traceback (most recent call last) Cell In[3], line 4 2 config.model_path = model_path 3 config.max_seq_len = 2048 ----> 4 model = ExLlama(config) 5 cache = ExLlamaCache(model) 6 tokenizer...

TianqiYe

Request for server API script without sessions

5

So I could just send a simple request and get simple response in a free-form mode, without any additional context

CORRUPTOR2037

performance & quality drop (3x) when setting top_p = 1.0 vs. 0.99

4

Performance when generating with top_p = 1.0 is about 3x slower than with top_p at anything(?) else, to duplicate try 0.99 and 1.0. I've seen this bug with both the...

matatonic

exllama
exllama copied to clipboard

Metadata

Benchmarks vs vLLM?

Tesla P40 only using 70W underload

3-bit and 2-bit GPTQ support

Adds the possibility to influence prediction with bias

ImportError: DLL load failed while importing exllama_ext: 找不到指定的模块。

Lora support

Very poor output quality

RuntimeError: CUDA error: an illegal memory access was encountered

Request for server API script without sessions

performance & quality drop (3x) when setting top_p = 1.0 vs. 0.99

← Metadata

Owner

Metadata

exllama exllama copied to clipboard

Metadata

← Metadata

Owner

Metadata

exllama
exllama copied to clipboard