lmql icon indicating copy to clipboard operation
lmql copied to clipboard

LMQL fails to use llama-cpp-python to load models

Open klutzydrummer opened this issue 1 year ago • 3 comments

I run into multiple issues when trying to use lmql in colab

When running a query without the verbose flag set or set to False, I get this error:

Code:

import requests
from pathlib import Path

model_file = "/content/zephyr-7b-beta.Q4_K_M.gguf"
if Path(model_file).exists():
    print("Model file exists")
else:
    print("Model file does not exist")
    # download model weights
    model_url = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf"
    r = requests.get(model_url)
    with open(model_file, "wb") as f:
        f.write(r.content)

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""

lmql.run_sync(
    query_string, 
    model = lmql.model(f"local:llama.cpp:{model_file}", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta', verbose=False))

Output:

[Loading llama.cpp model from  llama.cpp:/content/zephyr-7b-beta.Q4_K_M.gguf  with {'verbose': False} ]
Exception in thread scheduler-worker:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/lmtp_scheduler.py", line 269, in worker
    model = LMTPModel.load(self.model_identifier, **self.model_args)
  File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/backends/lmtp_model.py", line 41, in load
    return LMTPModel.registry[backend_name](model_name, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lmql/models/lmtp/backends/llama_cpp_model.py", line 22, in __init__
    self.llm = Llama(model_path=model_identifier[len("llama.cpp:"):], logits_all=True, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 841, in __init__
    with suppress_stdout_stderr(disable=self.verbose):
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_utils.py", line 23, in __enter__
    self.old_stdout_fileno_undup = self.sys.stdout.fileno()
io.UnsupportedOperation: fileno

Restarting the runtime or deleting and spinning up a new instance and setting verbose to True at least clears that portion of code, but never progresses beyond that: Code:

import requests
from pathlib import Path

model_file = "/content/zephyr-7b-beta.Q4_K_M.gguf"
if Path(model_file).exists():
    print("Model file exists")
else:
    print("Model file does not exist")
    # download model weights
    model_url = "https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q4_K_M.gguf"
    r = requests.get(model_url)
    with open(model_file, "wb") as f:
        f.write(r.content)

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]"
"""

lmql.run_sync(
    query_string, 
    model = lmql.model(f"local:llama.cpp:{model_file}", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta', verbose=True))

Output:

[Loading llama.cpp model from llama.cpp:/content/zephyr-7b-beta.Q4_K_M.gguf  with  {'verbose': True} ]
lmtp generate: [1, 28824, 28747, 1824, 349, 272, 21790, 302, 272, 2296, 4058, 28747, 8789, 1014, 2887, 403, 1215, 1179, 28723, 13940, 28832, 28804, 13, 28741, 28747, 28705] / '<s>Q: What is the sentiment of the following review: ```The food was very good.```?\nA: ' (26 tokens, temperature=0.0, max_tokens=128)

GPU RAM never increases, System RAM usage will increase sometimes, but doesn't seem consistent. It appears the model never actually loads and so never does any inferencing.

I've tried following examples from both the documentation and from other sites with no luck. A minimal example of the error canbe found at the following colab notebook: https://colab.research.google.com/drive/1aND43pi3v11fW_2kTYDHLaoxXV69HPWq?usp=sharing

klutzydrummer avatar Dec 13 '23 18:12 klutzydrummer

I realize the error with verbosity is a llama-cpp-python bug that has an open issue.

I've updated my example to show the lmql specific issue I am encountering. I can load the LLM with llama-cpp-python directly and get completions, however, when I try to use lmql I still don't get output, just hanging.

klutzydrummer avatar Dec 15 '23 16:12 klutzydrummer

Hi there, I just tried to reproduce this on my workstation, and it seems to all work. Can you make sure to re-install llama.cpp with the correct build flags to enable GPU support?

My test code:

import requests
from pathlib import Path
import lmql

model_file = "/home/luca/repos/lmql/zephyr-7b-beta.Q5_K_M.gguf"

query_string = """
"Q: What is the sentiment of the following review: ```The food was very good.```?\\n"
"A: [SENTIMENT]" where len(TOKENS(SENTIMENT)) < 32
"""

result = lmql.run_sync(
    query_string, 
    model = lmql.model(f"local:llama.cpp:{model_file}", 
        tokenizer = 'HuggingFaceH4/zephyr-7b-beta', 
        verbose=False
    ))

print([result])

Note that I put a token constraint on SENTIMENT, just to make sure we don't have termination issues here. For Colab, please note that sometimes the root of problems with termination is asyncio-related, so maybe try running in a standalone script first.

After re-installing llama.cpp do you still the same issue?

lbeurerkellner avatar Dec 17 '23 15:12 lbeurerkellner

截圖 2024-01-21 20 33 55

not sure why I got this error, lmql has no attribute 'run_sync'

lawyinking avatar Jan 21 '24 12:01 lawyinking