lmql
lmql copied to clipboard
Override Open AI API Base with llama.cpp mock server
I have a local server running an OpenAI compatible API. I simply want all requests that normally go to api.openai.com:443
go to localhost:8000
.
I did see that you should be able to overide models for Azure. I am hoping to use that but it still seems to make calls to openai.com.
import lmql
@lmql.query
async def test():
'''lmql
argmax "Hello [WHO]" from my_model
'''
my_model = lmql.model(
"openai/gpt-3.5-turbo",
api_base="http://localhost:8000"
)
items = await test()
items[0]
I still see it getting errors from openai:
Failed with Cannot connect to host api.openai.com:443 ssl:default [nodename nor servname provided, or not known]
OpenAI API: Underlying stream of OpenAI complete() call failed with error <class 'aiohttp.client_exceptions.ClientConnectorError'> Cannot connect to host api.openai.com:443 ssl:default [nodename nor servname provided, or not known] Retrying... (attempt: 0)
Is there a way to overide the OpenAI url?
Hi there :) api_base
is reserved for Azure OpenAI configuration only. To change the general endpoint, you can just specify endpoint=<ENDPOINT>
. This should probably be aligned, such that api_base can also be used for non-Azure endpoints, so thanks for raising this.
What mock server implementation are you using? In my experience true OpenAI API compliance is rare, so there may be other issue, as LMQL assume e.g. working batching and logit_bias support. Let me know how it goes.
I'm using llama.cpp's server. I wasn't sure if I could use all of the params that I am using with lmql's server. I couldn't find any doc on it. I use this command to host a version of llama70b locally:
export N_GQA=8 && python3 -m llama_cpp.server --model /Users/jward/Projects/llama.cpp/models/llama-2-70b-orca-200k.Q5_K_M.gguf --use_mlock True --n_gpu_layers 1
llama.cpp python has a bug for n_gqa so I have to set the env var for it.
In general, I'm able to use Open AI's python library if I override openai.api_base
This is probably outside of the scope, but I do see some activity with this code:
import lmql
import os
os.environ['OPENAI_API_KEY'] = 'fakekey'
@lmql.query
async def test():
'''lmql
argmax
"Say 'this is a test':[RESPONSE]"
from
lmql.model("gpt-4", endpoint="http://localhost:8000")
'''
items = await test()
items[0]
The server log shows it trying to post to /v1
and getting a 404, which is the reponse I expect.
INFO: ::1:52619 - "POST /v1 HTTP/1.1" 404 Not Found
OpenAI's python lib doesn't try to call /v1
. It just hits /v1/chat/completions
and works fine.
INFO: ::1:52768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Yes, the endpoint
parameter expects the full path to the resources to hit for completions, e.g. try appending the required /v1/chat/completions. For your lmql serve-model
command, you have to prepend the llama.cpp:
prefix, otherwise it will try to load your model via transformers
. See also https://docs.lmql.ai/en/stable/language/llama.cpp.html#model-server.
I think I was able to get the model to serve using, though it doesn't log output:
lmql serve-model llama.cpp:/Users/jward/Projects/llama.cpp/models/llama-2-13b-chat.Q8_0.gguf --use_mlock True --n_gpu_layers 1
It looks like the tokenizer isn't correct. Is there a way to set that?
AssertionError: Cannot set dclib tokenizer to hf-huggyllama/llama-7b because it is already set to tiktoken-gpt2 (cannot use multiple tokenizers in the same process for now)
The latest main
now actually finally supports mixing tokenizers in the same process. I am not sure, however, how this will work with the OpenAI endpoint
parameter. I think we hard-code the GPT tokenizers. I have never seen an alternative implementation of the OpenAI API that actually implemented logit_bias, so this never came up before.
Did your lmql serve-model
command end up working? Note that there is a "--verbose" option. I have seen issue like this before, when the GPU support was not compiled correctly. Can you test it, e.g. with llama-cpp-python
?
I appreciate your patience with me as I jumped a few topics. This command ended up working for me:
lmql serve-model llama.cpp:/Users/jward/Projects/llama.cpp/models/llama-2-70b-orca-200k.Q5_K_M.gguf --use_mlock True --n_gpu_layers 1 --n_gqa 8 --n_ctx 4096
Verbose output is also very helpful. Thanks again.
I'm running into the same issue.
# set the API type based on whether you want to use a completion or chat endpoint
os.environ['OPENAI_API_TYPE']='azure'
os.environ['OPENAI_API_BASE']="https://8b78-34-125-163-134.ngrok-free.app"
os.environ['OPENAI_API_KEY']="fake_key"
@lmql.query(model='v1')
async def chain_of_thought(question):
'''lmql
# Q&A prompt template
"Q: {question}\n"
"A: Let's think step by step.\n"
"[REASONING]"
"Thus, the answer is:[ANSWER]."
# return just the ANSWER to the caller
return ANSWER
'''
res = await chain_of_thought('Today is the 12th of June, what day was it 1 week ago?')
print(res)
TokenizerNotAvailableError: Failed to locate a suitable tokenizer implementation for 'v1' (Make sure your current environment provides a tokenizer backend like 'transformers', 'tiktoken' or 'llama.cpp' for this model)
If I switch to model='gpt-4' then the llama_cpp server outputs this:
INFO: 35.230.48.2:0 - "POST /openai/deployments/gpt-4/completions?api-version=2023-05-15 HTTP/1.1" 404 Not Found
@tranhoangnguyen03 The first error you get here indicates that LMQL cannot automatically derive a tokenizer from the model name v1. You can fix this by using a lmql.model("v1", tokenizer=<tokenizer name>)
object as model instead.
I tried:
@lmql.query(model=
lmql.model("v1",
tokenizer='HuggingFaceH4/zephyr-7b-alpha',
api_type="azure",
api_base="https://932d-34-141-210-25.ngrok-free.app"
)
)
And got this:
RuntimeError: LMTP client encountered an error: Exception Server disconnected attempting to communicate with lmtp endpoint: http://localhost:8080/. Please check that the endpoint is correct and the server is running.
Then I tried:
@lmql.query(model=
lmql.model("v1",
tokenizer='HuggingFaceH4/zephyr-7b-alpha',
endpoint="https://932d-34-141-210-25.ngrok-free.app"
)
)
And got this error:
RuntimeError: LMTP client encountered an error: Exception 403, message='Invalid response status', url=URL('https://932d-34-141-210-25.ngrok-free.app/') attempting to communicate with lmtp endpoint: https://932d-34-141-210-25.ngrok-free.app/. Please check that the endpoint is correct and the server is running.
Am I using the wrong kwarg here?
Ah yes, you have to use openai/v1
, so LMQL considers your model an OpenAI model. Without specifying this, it will attempt to load v1
as HuggingFace model.
In general, are you also trying to use a llama.cpp-based OpenAi mock endpoint? With what model are you trying this. Let me know, so I can try to reproduce your setup here.
That is correct. I'm running a zephyr-7b-alpha.Q6_K.gguf
model on Google Colab which I tunnel to a ngrok Public IP.
Here's an image showing the API endpoints:
Here's an example Curl call:
curl -X 'POST' \
'https://932d-34-141-210-25.ngrok-free.app/v1/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
"stop": [
"\n",
"###"
]
}'
I managed to get the server connection to work, via
argmax(verbose=True)
"[[INST]]Say 'this is a test':[[/INST]]\n[RESPONSE]"
from
lmql.model("openai/v1", tokenizer='gpt2', endpoint="<host>/2600/v1/completions")
where
len(TOKENS(RESPONSE)) < 120 and STOPS_BEFORE(RESPONSE, "[INST]") and not "\n" in RESPONSE
You don't have to use Azure API configuration, you can actually just specify the endpoint (which includes the /v1/completions
suffix). However, unfortunately this does not work as intended, since LMQL uses the echo
parameter available with the official OpenAI API, but not with the mock implementation llama.cpp provides. At least from the logs I can see, that llama.cpp does not respect this parameter, i.e. does not echo the prompt tokens.
This means LMQL for now does not support the mock implementation llama.cpp provides, because it does not implement it in a fully compliant manner. Hopefully this can be fixed on their end, as far as I skimmed the code, it does seem to implement logit_bias
properly, which is typically the harder thing to get with these kind of mock APIs. Maybe experimenting some more with echo
and then creating an issue over there could be a good way to resolve this.
Workaround Until then, I can encourage you to use LMQL's official llama.cpp backend, which can also be served via colab, and then accessed locally. In my experiments this works seamlessly using
lmql.model("llama.cpp:/home/luca/repos/models/zephyr-7b-alpha.Q6_K.gguf", endpoint="<HOST>:<PORT>", tokenizer="HuggingFaceH4/zephyr-7b-alpha")
and
lmql serve-model llama.cpp:/home/luca/repos/models/zephyr-7b-alpha.Q6_K.gguf --n_gpu_layers 30 --host <HOST> --port <PORT>
If you can't launch this via the command line, you can also use lmql.serve
, see this snippet for details..
@lbeurerkellner I've read over the Language Model Transport Protocol (LMTP) documentation, and it seems to me that the server is designed to work with a client deployed locally on the same machine? Does that means there's no support for a external model endpoint at the moment?
LMTP is typically used with client and server being different machines (e.g. with the server being some beefy GPU machine and client being a laptop). Note, however, that LMTP does not implement authentication mechanisms, so you want to protect the communication with e.g. an SSH tunnel.
does anyone manage to make it works, llama.cpp server with lmql? llama-cpp-python is full of bug, using llama.cpp server will be solving a lot of problem
Thank you