exllama
exllama copied to clipboard
Issue when attempting to run exllama (P40)
When starting with the command 'python test_benchmark_inference.py -d /home/rexommendation/Programs/KoboldAI/models/30B-Lazarus-GPTQ4bit -p -ppl' (I keep my models in other programs) I get the following error:
Traceback (most recent call last): File "/home/rexommendation/Programs/exllama/test_benchmark_inference.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig File "/home/rexommendation/Programs/exllama/model.py", line 12, in import cuda_ext File "/home/rexommendation/Programs/exllama/cuda_ext.py", line 43, in exllama_ext = load( ^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( ^^^^^^^^^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1601, in _write_ninja_file_and_build_library extra_ldflags = _prepare_ldflags( ^^^^^^^^^^^^^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1699, in _prepare_ldflags extra_ldflags.append(f'-L{_join_cuda_home("lib64")}') ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/rexommendation/Programs/exllama/venv/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home raise EnvironmentError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
this seems like an ENV issue rather than anything to do with exllama, do you have cuda installed correctly on your system?
yeah, via the package manager
Is your CUDA_HOME set?
@turboderp
How would I do that with the pacman version?
GPTQ (https://github.com/0cc4m/GPTQ-for-LLaMa) works just fine but I wonder what you guys do differently?
Well env | grep CUDA should tell you if the environment variable is set. If not, export CUDA_HOME=<path_to_cuda_home>.
As for the differences between ExLlama and GPTQ-for-LLaMa, they are numerous. ExLlama doesn't install a permanent CUDA module, for one. It does however need CUDA to be properly set up. What's your nvcc version?
I installed CUDA this way and now nvcc --version shows
What model are you running and how is the performance? I am curious of getting a P40 just for an LLM
ExLlama really doesn't like P40s, all the heavy math it does is in FP16, and P40s are very very poor at FP16 math. Alternatively a P100 (or three) would work better given that their FP16 performance is pretty good (over 100x better than P40 despite also being Pascal, for unintelligible Nvidia reasons); as well as anything Turing/Volta or newer, provided there's enough VRAM. For P40s, the fastest current project is probably llama.cpp; their CUDA dev has like three of them and has specifically optimized for them.
Interesting, I have tried LlamaCPP with RTX4000 GPU, did not see much improvements with Vicunia13B model in GGML format. Perhaps I should use a different model. Exllama is at least 10 times faster on my GPU
If llama.cpp is that much slower, I'd double-check your --n-gpu-layers argument, I believe it should be set >= 43 for all layers of 13B to be offloaded to the GPU. But yes, I would expect an RTX 4000 (either Turing or Ada version—also what the hell is Nvidia thinking with that model naming scheme? NVIDIA RTX 4000 SFF Ada Generation, which is completely different from NVIDIA QUADRO RTX 4000 from 5 years earlier? Really? oops, digressing—) to perform better with ExLlama than llama.cpp, but probably not by a factor of 10.
I think my GPU build is not working properly for llama.cpp even with all layers enabled for some reason it does not use enough available Vram. I will rebuild it perhaps my architecture is not set correctly. Thank you! I will have another look at it!
Maybe it wasn't compiled with cuBLAS? I've neglected to set the right compiler flags before and llama.cpp just silently falls back to pure CPU. Or, well, it'll positively tell you you are using CUDA by printing something like
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
but if it's not set up to support CUDA, it just ignores args like --n-gpu-layers instead of complaining like it probably should.
Lama.cpp seems builds fine for me now, GPU works, but my issue was mainly with lama-node implementation of it. I should have just started with lama-cpp. No mater what I do, llama-node uses CPU. The llmatic package uses llama-node to make openai compatible api. Which is very useful, since most chat UIs are build around it. Thank you for your help, lama-cpp is indeed fast!