instructor
instructor copied to clipboard
Response with self hosted model very slow / Timeout
- [x] This is an actually a bug report.
- [x] I am not getting good LLM Results
- [ ] I have tried asking for help in the community on discord or discussions and have not received a response.
- [x] I have tried searching the documentation and have not found an answer.
What Model are you using?
- [ ] gpt-3.5-turbo
- [ ] gpt-4-turbo
- [ ] gpt-4
- [x] Other (please specify)
Describe the bug I get a gateway timeout or a response after more than a minute. The normal response time using only mistral7B without instructor is by far faster. (Ollama running on a 4090) Using instructor i get the following log entries on the server running ollama: Error #01: write tcp 127.0.0.1:5000->127.0.0.1:49900: write: broken pipe [GIN] 2024/02/19 - 11:35:05 | 200 | 1m0s | 10.212.134.177 | POST "/v1/chat/completions" Error #01: write tcp 127.0.0.1:5000->127.0.0.1:51430: write: broken pipe [GIN] 2024/02/19 - 11:36:06 | 200 | 1m0s | 10.212.134.177 | POST "/v1/chat/completions" Error #01: write tcp 127.0.0.1:5000->127.0.0.1:58504: write: broken pipe [GIN] 2024/02/19 - 11:37:08 | 200 | 1m0s | 10.212.134.177 | POST "/v1/chat/completions" Error #01: write tcp 127.0.0.1:5000->127.0.0.1:43210: write: broken pipe
I dont get the broken pipe messages with inference only.
To Reproduce Host a mistral 7b model with ollama version 1.25.0 and request the extraction like in the docs. (See screenshot)
Expected behavior A response in a timeframe between 1 and 10 seconds
Screenshots
- can you add a timer before and after the create call and time that specific call
- what happens when you delete response_model and just say" return json"
The return value without specifying the response_model:
ChatCompletion(id='chatcmpl-679', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The statement "Jason is 25 years old" indicates that the name "Jason" is associated with the age of 25 years. It does not provide any additional information beyond this simple fact.', role='assistant', function_call=None, tool_calls=None))], created=1708356175, model='mistral', object='chat.completion', system_fingerprint='fp_ollama', usage=CompletionUsage(completion_tokens=44, prompt_tokens=17, total_tokens=61))
I guess the llm's response is not what it should be? Time taken 0.4144 seconds. (second run, the first was 1.3seconds)
Another try with the response model led to a timeout.
Interesting. Will investigate. Thanks
I am experiencing long response times as well. How do I look at the prompt being delivered to the ollama model?
Confirming that patching in ollama causes intermittent hanging/running the GPU at 100% until I CTRL-C out of the app. Switching to OpenAI seems to fix the problem 100%
I got this after letting it sit spinning for about 20min ollama_instructor_ermsg.txt
possibly related Ollama 2709
what happens when you turn on logging? https://jxnl.github.io/instructor/concepts/logging/
Hey y'all, this issue has been popping up recently in Ollama. The likely culprit is that you need to add a hint to the system prompt that tells the LLM to respond in JSON. Do people still run into this issue with a system prompt like this?
client.chat.completions.create(
model="mistral:latest",
response_model=UserDetail,
messages=[
{"role": "system", "content": "respond in JSON only"},
{"role": "user", "content": "Extract Jason is 25 years old"},
],
)
instructor should be adding that to the prompt
Interestingly, this seems to happen with the Mistral model, but with exactly the same code and config with llama2 I'm not hitting this issue.
I ran the example with timing - on one submission, it completed in 5s, but on another captured below it took 20m.
%%time
resp = client.chat.completions.create(
model="mistral:instruct",
messages=[
{
"role": "user",
"content": "Tell me about the Harry Potter",
}
],
response_model=Character,
)
print(resp.model_dump_json(indent=2))
{
"name": "Harry Potter",
"age": 34,
"fact": [
"The main character in J.K. Rowling's Harry Potter series",
"Born to Jewish parents Anne and James Potter on July 31",
"Survived the killing curse as an infant and was brought up by his Muggle aunt and uncle, the Dursleys",
"Attended Hogwarts School of Witchcraft and Wizardry from 1991 to 1997, where he became a Gryffindor student",
"Became a famous Quidditch player while at Hogwarts",
"Friends with Hermione Granger and Ron Weasley",
"Fought against Lord Voldemort, who sought to kill Harry because of his connection to Voldemort's past defeat by the Curse of Avada Kedavra",
"Became a powerful wizard and a prominent figure in the fight against evil",
"Eventually became Headmaster at Hogwarts after the departure of Albus Dumbledore"
]
}
CPU times: user 19.6 ms, sys: 5.72 ms, total: 25.3 ms
Wall time: 20min 10s
ollama version is 0.1.28
instructor==0.6.4
Some more local testing, it responsds in a normal (<10s) amount of time if I add the above suggestion (https://github.com/jxnl/instructor/issues/445#issuecomment-1979240193) to the system prompt.
resp = client.chat.completions.create(
model="mistral:instruct",
messages=[
{
"role": "system",
"content": "You are a helpful assistant that responsds only in JSON.",
},
{
"role": "user",
"content": "Tell me about Harry Potter",
},
],
response_model=Character,
)
print(resp.model_dump_json(indent=2))
oh interesting @pmbaumgartner are you able to make a PR to update the MD_JSON mode?
@jxnl I can take a look. Could you point me in the right place in the codebase to start? I've been digging through the codebase and I'm not totally clear on how this works with Ollama. For example - I'm confused on why MD_JSON mode is the right mode to investigate here since we aren't explicit with providing that.
I have another thought on the long response time issue that might help with this issue too. I noticed from reading the Outlines docs that they mention smaller models struggle with deciding what kind of whitespace to use. I wonder if there's some way to adopt that finding for improving the JSON generation with local models as well.
https://github.com/jxnl/instructor/blob/4c247185d3e9ad6df786bb6a5f1157da3ea0b1a4/instructor/process_response.py#L246
@jxnl Thanks! So instructor
isn't doing anything with ollama
's JSON Output mode? It's just done through prompting?
With ollama there's json mode but not json_schema mode.