instructor Response with self hosted model very slow / Timeout

[x] This is an actually a bug report.
[x] I am not getting good LLM Results
[ ] I have tried asking for help in the community on discord or discussions and have not received a response.
[x] I have tried searching the documentation and have not found an answer.

What Model are you using?

[ ] gpt-3.5-turbo
[ ] gpt-4-turbo
[ ] gpt-4
[x] Other (please specify)

Describe the bug I get a gateway timeout or a response after more than a minute. The normal response time using only mistral7B without instructor is by far faster. (Ollama running on a 4090) Using instructor i get the following log entries on the server running ollama: Error #01: write tcp 127.0.0.1:5000->127.0.0.1:49900: write: broken pipe [GIN] 2024/02/19 - 11:35:05 | 200 | 1m0s | 10.212.134.177 | POST "/v1/chat/completions" Error #01: write tcp 127.0.0.1:5000->127.0.0.1:51430: write: broken pipe [GIN] 2024/02/19 - 11:36:06 | 200 | 1m0s | 10.212.134.177 | POST "/v1/chat/completions" Error #01: write tcp 127.0.0.1:5000->127.0.0.1:58504: write: broken pipe [GIN] 2024/02/19 - 11:37:08 | 200 | 1m0s | 10.212.134.177 | POST "/v1/chat/completions" Error #01: write tcp 127.0.0.1:5000->127.0.0.1:43210: write: broken pipe

I dont get the broken pipe messages with inference only.

To Reproduce Host a mistral 7b model with ollama version 1.25.0 and request the extraction like in the docs. (See screenshot)

Expected behavior A response in a timeframe between 1 and 10 seconds

Screenshots

Feb 19 '24 10:02 thilasx

can you add a timer before and after the create call and time that specific call
what happens when you delete response_model and just say" return json"

Feb 19 '24 14:02 jxnl

The return value without specifying the response_model: ChatCompletion(id='chatcmpl-679', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The statement "Jason is 25 years old" indicates that the name "Jason" is associated with the age of 25 years. It does not provide any additional information beyond this simple fact.', role='assistant', function_call=None, tool_calls=None))], created=1708356175, model='mistral', object='chat.completion', system_fingerprint='fp_ollama', usage=CompletionUsage(completion_tokens=44, prompt_tokens=17, total_tokens=61))

I guess the llm's response is not what it should be? Time taken 0.4144 seconds. (second run, the first was 1.3seconds)

Another try with the response model led to a timeout.

Feb 19 '24 15:02 thilasx

Interesting. Will investigate. Thanks

Feb 19 '24 16:02 jxnl

I am experiencing long response times as well. How do I look at the prompt being delivered to the ollama model?

Feb 28 '24 02:02 MeDott29

Confirming that patching in ollama causes intermittent hanging/running the GPU at 100% until I CTRL-C out of the app. Switching to OpenAI seems to fix the problem 100%

Mar 04 '24 20:03 tallestmex-marigoldlabs

I got this after letting it sit spinning for about 20min ollama_instructor_ermsg.txt

possibly related Ollama 2709

Mar 05 '24 12:03 jimmy6DOF

what happens when you turn on logging? https://jxnl.github.io/instructor/concepts/logging/

Mar 05 '24 14:03 jxnl

Hey y'all, this issue has been popping up recently in Ollama. The likely culprit is that you need to add a hint to the system prompt that tells the LLM to respond in JSON. Do people still run into this issue with a system prompt like this?

client.chat.completions.create(
    model="mistral:latest",
    response_model=UserDetail,
    messages=[
        {"role": "system", "content": "respond in JSON only"},
        {"role": "user", "content": "Extract Jason is 25 years old"},
    ],
)

Mar 05 '24 17:03 BruceMacD

instructor should be adding that to the prompt

Mar 05 '24 17:03 jxnl

Interestingly, this seems to happen with the Mistral model, but with exactly the same code and config with llama2 I'm not hitting this issue.

Mar 06 '24 23:03 avyfain

I ran the example with timing - on one submission, it completed in 5s, but on another captured below it took 20m.

%%time
resp = client.chat.completions.create(
    model="mistral:instruct",
    messages=[
        {
            "role": "user",
            "content": "Tell me about the Harry Potter",
        }
    ],
    response_model=Character,
)
print(resp.model_dump_json(indent=2))

{
  "name": "Harry Potter",
  "age": 34,
  "fact": [
    "The main character in J.K. Rowling's Harry Potter series",
    "Born to Jewish parents Anne and James Potter on July 31",
    "Survived the killing curse as an infant and was brought up by his Muggle aunt and uncle, the Dursleys",
    "Attended Hogwarts School of Witchcraft and Wizardry from 1991 to 1997, where he became a Gryffindor student",
    "Became a famous Quidditch player while at Hogwarts",
    "Friends with Hermione Granger and Ron Weasley",
    "Fought against Lord Voldemort, who sought to kill Harry because of his connection to Voldemort's past defeat by the Curse of Avada Kedavra",
    "Became a powerful wizard and a prominent figure in the fight against evil",
    "Eventually became Headmaster at Hogwarts after the departure of Albus Dumbledore"
  ]
}
CPU times: user 19.6 ms, sys: 5.72 ms, total: 25.3 ms
Wall time: 20min 10s

ollama version is 0.1.28

instructor==0.6.4

Mar 11 '24 02:03 pmbaumgartner

Some more local testing, it responsds in a normal (<10s) amount of time if I add the above suggestion (https://github.com/jxnl/instructor/issues/445#issuecomment-1979240193) to the system prompt.

resp = client.chat.completions.create(
    model="mistral:instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant that responsds only in JSON.",
        },
        {
            "role": "user",
            "content": "Tell me about Harry Potter",
        },
    ],
    response_model=Character,
)
print(resp.model_dump_json(indent=2))

Mar 11 '24 02:03 pmbaumgartner

oh interesting @pmbaumgartner are you able to make a PR to update the MD_JSON mode?

Mar 11 '24 03:03 jxnl

@jxnl I can take a look. Could you point me in the right place in the codebase to start? I've been digging through the codebase and I'm not totally clear on how this works with Ollama. For example - I'm confused on why MD_JSON mode is the right mode to investigate here since we aren't explicit with providing that.

I have another thought on the long response time issue that might help with this issue too. I noticed from reading the Outlines docs that they mention smaller models struggle with deciding what kind of whitespace to use. I wonder if there's some way to adopt that finding for improving the JSON generation with local models as well.

Mar 15 '24 09:03 pmbaumgartner

https://github.com/jxnl/instructor/blob/4c247185d3e9ad6df786bb6a5f1157da3ea0b1a4/instructor/process_response.py#L246

Mar 15 '24 17:03 jxnl

@jxnl Thanks! So instructor isn't doing anything with ollama's JSON Output mode? It's just done through prompting?

Mar 15 '24 17:03 pmbaumgartner

With ollama there's json mode but not json_schema mode.

Mar 15 '24 17:03 jxnl

instructor instructor copied to clipboard

Response with self hosted model very slow / Timeout

instructor
instructor copied to clipboard