FileNotFoundError: Could not find module '...ctransformers\lib\cuda\ctransformers.dll' (or one of its dependencies).
Hi, i can run below code last week without problem but i got below error since some days ago (after upgrade ctransformers lib). I am unable to run ctransformers with all local LLM model now. Can anyone help to solve it? Thanks.
My PC: Win10, python 3.10.6, ctransformers 0.2.24
My python code:
from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained(r'H:\TheBloke_Llama-2-13B-chat-GGML\llama-2-13b-chat.ggmlv3.q4_1.bin', model_type='llama', stream=True, gpu_layers=50)
while True: print("\n--------------------------\n") user_input = input("Your Input:") for chunk in llm(user_input, stream=True): print(chunk, end='', flush=True)
Error after upgrade ctransformers lib:
Traceback (most recent call last):
File "H:\localLlama-2-13B-Chat_StreamOutput.py", line 2, in
I verified the file exist:
'C:\Users\me\AppData\Local\Programs\Python\Python310\Lib\site-packages\ctransformers\lib\cuda\ctransformers.dll'
Please run the following command and post the output:
pip show ctransformers nvidia-cuda-runtime-cu12 nvidia-cublas-cu12
Make sure you have installed the CUDA libraries using:
pip install ctransformers[cuda]
@marella Thank you for your hints. After re-install these, it works fine: pip install ctransformers[cuda] pip install nvidia-cublas-cu11 pip install nvidia-cuda-runtime-cu11
I was having the same issue and the above solution worked. Thanks a lot @marella . But the model isn't utilizing the GPU properly.
OS: Windows 11
RAM: 32 gb
CPU: Intel i7-8550U @ 1.80 Ghz
GPU: Geforce MX150
I'm using nvitop to monitor gpu usage and here is what it looks like when I run a simple query on llama-2-7b-chat 8bit quantized ggml model.
The GPU memory reaches ~50% when I load the model into memory, but when I run it for inference the GPU MEM increases to ~85% while GPU UTL remains ~5% fluctuating to 30% ocassionally. It doesn't seem right, coz I also run this model with the same code and everything on a different PC where the GPU UTL usually stays consistent at about ~55-65%.
Code:
model_name = "llama-2-7b-chat.ggmlv3.q8_0.bin"
llm = AutoModelForCausalLM.from_pretrained(f'../models/{model_name}',
model_type='llama',
gpu_layers=4,
temperature=0.7,
max_new_tokens=512,
top_k=40,
batch_size=8,
repetition_penalty=1.2,
top_p=0.70,
local_files_only=True,
context_length=2048)
system_message = "You are a respectful and helpful assistant. Understand the Instruction and respond appropriately"
instruction = "Write an acrostic poem in the style of Rober Frost about how humans are harming the earth."
prompt_template = f"""System: {system_message}
Instruction: {instruction}
Assistant: """
tokens = llm.tokenize(prompt_template)
generated_tokens = llm.generate(tokens)
generated_text = llm.detokenize(generated_tokens)```