lmql
lmql copied to clipboard
LMQL fails to use llama-cpp-python to load models
I run into multiple issues when trying to use lmql in colab
When running a query without the verbose flag set or set to False, I get this error:
Code:
import requests
from pathlib import Path
model_file = "/content/zephyr-7b-beta.Q4_K_M.gguf"
if Path(model_file).exists():
print("Model file exists")
else:
print("Model file does not exist")
# download model weights
model_url = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf"
r = requests.get(model_url)
with open(model_file, "wb") as f:
f.write(r.content)
query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""
lmql.run_sync(
query_string,
model = lmql.model(f"local:llama.cpp:{model_file}",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta', verbose=False))
Output:
[Loading llama.cpp model from llama.cpp:/content/zephyr-7b-beta.Q4_K_M.gguf with {'verbose': False} ]
Exception in thread scheduler-worker:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/lmtp_scheduler.py", line 269, in worker
model = LMTPModel.load(self.model_identifier, **self.model_args)
File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/backends/lmtp_model.py", line 41, in load
return LMTPModel.registry[backend_name](model_name, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/backends/llama_cpp_model.py", line 22, in __init__
self.llm = Llama(model_path=model_identifier[len("llama.cpp:"):], logits_all=True, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 841, in __init__
with suppress_stdout_stderr(disable=self.verbose):
File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_utils.py", line 23, in __enter__
self.old_stdout_fileno_undup = self.sys.stdout.fileno()
io.UnsupportedOperation: fileno
Restarting the runtime or deleting and spinning up a new instance and setting verbose to True at least clears that portion of code, but never progresses beyond that: Code:
import requests
from pathlib import Path
model_file = "/content/zephyr-7b-beta.Q4_K_M.gguf"
if Path(model_file).exists():
print("Model file exists")
else:
print("Model file does not exist")
# download model weights
model_url = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf"
r = requests.get(model_url)
with open(model_file, "wb") as f:
f.write(r.content)
query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""
lmql.run_sync(
query_string,
model = lmql.model(f"local:llama.cpp:{model_file}",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta', verbose=True))
Output:
[Loading llama.cpp model from llama.cpp:/content/zephyr-7b-beta.Q4_K_M.gguf with {'verbose': True} ]
lmtp generate: [1, 28824, 28747, 1824, 349, 272, 21790, 302, 272, 2296, 4058, 28747, 8789, 1014, 2887, 403, 1215, 1179, 28723, 13940, 28832, 28804, 13, 28741, 28747, 28705] / '<s>Q: What is the sentiment of the following review: ```The food was very good.```?\nA: ' (26 tokens, temperature=0.0, max_tokens=128)
GPU RAM never increases, System RAM usage will increase sometimes, but doesn't seem consistent. It appears the model never actually loads and so never does any inferencing.
I've tried following examples from both the documentation and from other sites with no luck. A minimal example of the error canbe found at the following colab notebook: https://colab.research.google.com/drive/1aND43pi3v11fW_2kTYDHLaoxXV69HPWq?usp=sharing
I realize the error with verbosity is a llama-cpp-python bug that has an open issue.
I've updated my example to show the lmql specific issue I am encountering. I can load the LLM with llama-cpp-python directly and get completions, however, when I try to use lmql I still don't get output, just hanging.
Hi there, I just tried to reproduce this on my workstation, and it seems to all work. Can you make sure to re-install llama.cpp with the correct build flags to enable GPU support?
My test code:
import requests
from pathlib import Path
import lmql
model_file = "/home/luca/repos/lmql/zephyr-7b-beta.Q5_K_M.gguf"
query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]" where len(TOKENS(SENTIMENT)) < 32
"""
result = lmql.run_sync(
query_string,
model = lmql.model(f"local:llama.cpp:{model_file}",
tokenizer = 'HuggingFaceH4/zephyr-7b-beta',
verbose=False
))
print([result])
Note that I put a token constraint on SENTIMENT
, just to make sure we don't have termination issues here. For Colab, please note that sometimes the root of problems with termination is asyncio-related, so maybe try running in a standalone script first.
After re-installing llama.cpp do you still the same issue?
not sure why I got this error, lmql has no attribute 'run_sync'