ollama-python 504 Gateway Timeout - The server didn't respond in time

I don't know why but I'm encountering this problem with the library. Here I show my simple script:

import ollama

client = ollama.Client(host=llm_config["base_url"], timeout=600)
client.chat(model=config["ollama"]["model"], messages=[{
    "role":"user",
    "content":"Why is the sky blue?"
}])

Where llm_config["base_url"] is the ollama url server (it's a serverless gpu) that I can reach successfully from open-webui and even query the model without issues. The model I'm using is: qwen2.5:32b-instruct-q4_K_M and the GPU is a RTX A6000.

The traceback (client-side) is the following:

Traceback (most recent call last):
  File "/mnt/shared/devilteo911/cvr-agent/.venv/lib/python3.11/site-packages/ollama/_client.py", line 236, in chat
    return self._request_stream(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared/devilteo911/cvr-agent/.venv/lib/python3.11/site-packages/ollama/_client.py", line 99, in _request_stream
    return self._stream(*args, **kwargs) if stream else self._request(*args, **kwargs).json()
                                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/shared/devilteo911/cvr-agent/.venv/lib/python3.11/site-packages/ollama/_client.py", line 75, in _request
    raise ResponseError(e.response.text, e.response.status_code) from None
ollama._types.ResponseError: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

and this is what I see on the server side:

[GIN] 2024/11/07 - 22:04:21 | 500 | 50.001124922s |    xx.xx.xx.xx | POST     "/api/chat"

It happens everytime after 50 seconds even if the timeout is 600 seconds. Am I missing something?

Nov 07 '24 22:11 devilteo911

I have the same issue

Nov 07 '24 22:11 MatteoSid

Hey @devilteo911 - have you tried not setting a timeout and seeing if there's an issue on the server side regardless? Trying to narrow down if some information is not passing all the way through to the server or if there is an error on the server side.

Thanks!

Nov 15 '24 05:11 ParthSareen

Hey @ParthSareen,

The issue seems to occur only on the first call, which consistently results in a 504 error. Subsequent calls with the same input perform the generation without any problems.

I believe the problem is related to the time it takes to generate the first token, particularly during a cold start of my service. During a cold start, the model needs to be downloaded from Hugging Face, as my serverless GPU provider lacks permanent storage to keep the model locally.

I hope this clarifies the issue.

Nov 19 '24 13:11 devilteo911

I experienced the issue intermittently when I looped it like this:

$ while true ; do curl http://localhost:11434/api/chat -d '{ "model": "llama3.1:70b", "keep_alive": "0", "options": {"num_thread": 16},  "messages": [{"role": "user", "content": "Why is the sky blue?"}]}' ; sleep 1 ; done
...
...
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
...
...

Is there a way to increase the timeout? Had tried to start it like this but also still intermittently timed out 😕

# OLLAMA_LOAD_TIMEOUT=30m0s /bin/ollama serve

Jan 16 '25 10:01 felixmarch

FYI: I tried to use "keep_alive": "0" but got an error. When I changed it to "keep_alive": 0 (notice that I removed the quotes from around the 0, the immediate errors ended, but I still got the 504. I even tried increasing the value to 360000 (I assumed thtat these are milliseconds) and I still got a 504. Is there something that we are missing here? 🤔

Feb 03 '25 21:02 meidaid

🤗 For anybody using nginx as their reverse proxy I found the following solution from https://stackoverflow.com/questions/43832389/what-can-i-do-to-fix-a-504-gateway-timeout-error actually solved my problem when combined with the above advice on including the "keep_alive" seemed to make the 504 vanish for me. 🤗

The following is from the Stack Overflow post:

For those experiencing this error, that have access to their app / site's hosting environment, which is proxying through NGINX, this issue can be fixed by extending the timeout for API requests.

In your /etc/nginx/sites-available/default or /etc/nginx/nginx.conf add the following variables:
proxy_connect_timeout       240;
proxy_send_timeout          240;
proxy_read_timeout          240;
send_timeout                240;
Run sudo nginx -t to check the syntax, and then sudo service nginx restart.

This should effectively quadruple the time before NGINX will timeout your API requests (the default being 60 seconds, our new timeout being 240 seconds).

Feb 03 '25 21:02 meidaid

I have arrived here after already changing nginx timeouts. Using a containerised app in k8s, I have increased ingress timouts, gunicorn worker timeouts, and the gninx proxy timeouts all to 600 seconds. I still see error 504 gateway timeout , and ollama appears to be the only place left.

Is there any value controlling timeout on ollama serve API calls?

Jun 09 '25 08:06 mattjwarren

ollama-python ollama-python copied to clipboard

504 Gateway Timeout - The server didn't respond in time

ollama-python
ollama-python copied to clipboard