exllama Issue when attempting to run exllama (P40)

When starting with the command 'python test_benchmark_inference.py -d /home/rexommendation/Programs/KoboldAI/models/30B-Lazarus-GPTQ4bit -p -ppl' (I keep my models in other programs) I get the following error:

Traceback (most recent call last): File "/home/rexommendation/Programs/exllama/test_benchmark_inference.py", line 1, in from model import ExLlama, ExLlamaCache, ExLlamaConfig File "/home/rexommendation/Programs/exllama/model.py", line 12, in import cuda_ext File "/home/rexommendation/Programs/exllama/cuda_ext.py", line 43, in exllama_ext = load( ^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( ^^^^^^^^^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1601, in _write_ninja_file_and_build_library extra_ldflags = _prepare_ldflags( ^^^^^^^^^^^^^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1699, in _prepare_ldflags extra_ldflags.append(f'-L{_join_cuda_home("lib64")}') ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home raise EnvironmentError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

Jun 29 '23 20:06 wereretot

this seems like an ENV issue rather than anything to do with exllama, do you have cuda installed correctly on your system?

Jun 29 '23 20:06 nikshepsvn

yeah, via the package manager

Jun 30 '23 18:06 wereretot

Is your CUDA_HOME set?

Jul 01 '23 12:07 turboderp

@turboderp

How would I do that with the pacman version?

Jul 03 '23 12:07 wereretot

GPTQ (https://github.com/0cc4m/GPTQ-for-LLaMa) works just fine but I wonder what you guys do differently?

Jul 03 '23 12:07 wereretot

Well env | grep CUDA should tell you if the environment variable is set. If not, export CUDA_HOME=<path_to_cuda_home>.

As for the differences between ExLlama and GPTQ-for-LLaMa, they are numerous. ExLlama doesn't install a permanent CUDA module, for one. It does however need CUDA to be properly set up. What's your nvcc version?

Jul 03 '23 12:07 turboderp

I installed CUDA this way and now nvcc --version shows

Jul 03 '23 16:07 wereretot

What model are you running and how is the performance? I am curious of getting a P40 just for an LLM

Aug 03 '23 01:08 dspasyuk

ExLlama really doesn't like P40s, all the heavy math it does is in FP16, and P40s are very very poor at FP16 math. Alternatively a P100 (or three) would work better given that their FP16 performance is pretty good (over 100x better than P40 despite also being Pascal, for unintelligible Nvidia reasons); as well as anything Turing/Volta or newer, provided there's enough VRAM. For P40s, the fastest current project is probably llama.cpp; their CUDA dev has like three of them and has specifically optimized for them.

Aug 03 '23 02:08 EyeDeck

Interesting, I have tried LlamaCPP with RTX4000 GPU, did not see much improvements with Vicunia13B model in GGML format. Perhaps I should use a different model. Exllama is at least 10 times faster on my GPU

Aug 03 '23 03:08 dspasyuk

If llama.cpp is that much slower, I'd double-check your --n-gpu-layers argument, I believe it should be set >= 43 for all layers of 13B to be offloaded to the GPU. But yes, I would expect an RTX 4000 (either Turing or Ada version—also what the hell is Nvidia thinking with that model naming scheme? NVIDIA RTX 4000 SFF Ada Generation, which is completely different from NVIDIA QUADRO RTX 4000 from 5 years earlier? Really? oops, digressing—) to perform better with ExLlama than llama.cpp, but probably not by a factor of 10.

Aug 03 '23 06:08 EyeDeck

I think my GPU build is not working properly for llama.cpp even with all layers enabled for some reason it does not use enough available Vram. I will rebuild it perhaps my architecture is not set correctly. Thank you! I will have another look at it!

Aug 03 '23 18:08 dspasyuk

Maybe it wasn't compiled with cuBLAS? I've neglected to set the right compiler flags before and llama.cpp just silently falls back to pure CPU. Or, well, it'll positively tell you you are using CUDA by printing something like

llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device

but if it's not set up to support CUDA, it just ignores args like --n-gpu-layers instead of complaining like it probably should.

Aug 05 '23 23:08 EyeDeck

Lama.cpp seems builds fine for me now, GPU works, but my issue was mainly with lama-node implementation of it. I should have just started with lama-cpp. No mater what I do, llama-node uses CPU. The llmatic package uses llama-node to make openai compatible api. Which is very useful, since most chat UIs are build around it. Thank you for your help, lama-cpp is indeed fast!

Aug 06 '23 03:08 dspasyuk

exllama exllama copied to clipboard

Issue when attempting to run exllama (P40)

exllama
exllama copied to clipboard