azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Website dies randomly when asking questions
I've deployed the project to azure as instructed by using azd up
. I did not use any previous azure resources. Everything was made by the supplied scripts in the repository.
What's going on? Inspecting the Live stream logs of the backend application, I don't see anything special. But basically, the azure website stops responding at certain points in time, specifically after asking a question in chat. I'm not sure what's going on behind the scenes.
I have the same problem today! I did not see this behavior before.
I cannot find any regularity in this. It is not necessarily when spamming the chatbot (can also happen at the 4th question of that day). It's also not with any specific large or weird requests. I can't reproduce it on purpose, but it happened several times today.
Hmm. Have you tried opening the Network console in the browser to see if an HTTP request is going through to the backend? A successful request looks like:
Hmm. Have you tried opening the Network console in the browser to see if an HTTP request is going through to the backend? A successful request looks like:
After inspecting the HTTP Server Errors, I noticed a few 504 Gateway Timeout errors and 500, but I can't find any information as to why they were produced.
Yes, many of the requests to the /chat endpoint end up with a HTTP Status code of 200 (OK). But there comes a time where either:
- The website crashes down after asking a question.
- The response takes upto 3-5 minutes for a specific
/chat
request. While that's happening, it seems the whole site becomes unresponsive, as if python has a single thread serving the whole thing...
Could anyone provide more insights as to what's happening? Or any possible workarounds to this?
Hi @kikaragyozov, After I deployed the template I saw somewhat similar issues: simultaneous requests took a long time to return, and then returned in sequence as if there was only one thread serving them. I didn't see the 500 or 504 errors you mention.
I configured diagnostic settings on the App-Service to send logs to a Log Analytics workspace [1]. In those logs I found that when the App-Service starts up gunicorn (the WSGI server that App-Service uses to serve Python apps) is using synchronous workers and only starting one worker thread:
I'm still learning about App-Service but if you're seeing the same as me, then it may be possible to change gunicorn configuration [2] , and there's also an option to scale up the app-service plan.
Finally, you might want to add timeouts to API calls. E.g. adding a request_timeout parameter to the openai.Completion.create() calls within chatreadretrieveread.py [3]
After quite some testing, I can't seem to reproduce the error again today. Edit: later that day I did encounter the issue again unfortunately. Can this be investigated please.
I did also have one very slow request of >2 mins.
In the log stream of the web app, I could see this:
Along with some other requests to assets. I do not know if these are correlated but maybe this leads to another discovery.
Hi @kikaragyozov, After I deployed the template I saw somewhat similar issues: simultaneous requests took a long time to return, and then returned in sequence as if there was only one thread serving them. I didn't see the 500 or 504 errors you mention.
I configured diagnostic settings on the App-Service to send logs to a Log Analytics workspace [1]. In those logs I found that when the App-Service starts up gunicorn (the WSGI server that App-Service uses to serve Python apps) is using synchronous workers and only starting one worker thread:
I'm still learning about App-Service but if you're seeing the same as me, then it may be possible to change gunicorn configuration [2] , and there's also an option to scale up the app-service plan.
Finally, you might want to add timeouts to API calls. E.g. adding a request_timeout parameter to the openai.Completion.create() calls within chatreadretrieveread.py [3]
You just described what I've been dealing with!
Is it possible to use asynchronous workers? Basically having multiple threads serve requests without blocking on I/O?
Hi @kikaragyozov, glad that's useful.
As this repository is just a simple demo I can understand why it's not async, but I've found you can adapt it very easily.
First, I think this article about Gunicorn is helpful for understanding the options: https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
It was pretty easy to
-
Add
gevent
as a requirement toapp/backend/requirements.txt
and redeploy the app. I'm usinggevent==22.10.2
and no other code changes were required. -
Provide a custom Start-up command for the app-service and then restart it. Here's what I'm using (still with the smallest, B1 instance):
gunicorn --bind=0.0.0.0 --timeout=600 --worker-class=gevent --worker-connections=1000 --workers=3 app:app
Hope that helps you too.
@mbrenigjones Thank you for providing the steps to make it async. However I would like to set that startup command in the bicep configuration, such that we don't have to go to those settings every time. I cannot find where it should be added, can someone figure it out? Thanks
You can override the startup command by specifying appCommandLine in the Bicep, as either a filename (pointing at a shell script) or the actual command. For example, in another Flask app, I have this startup command:
https://github.com/pamelafox/flask-db-quiz-example/blob/main/infra/main.bicep#L65
That points at this startup.sh script: https://github.com/pamelafox/flask-db-quiz-example/blob/main/src/startup.sh
As a best practice, I have a gunicorn.conf.py file with the gunicorn configuration, that allows me to vary the workers based off CPU count:
https://github.com/pamelafox/flask-db-quiz-example/blob/main/src/gunicorn.conf.py
thanks @kikaragyozov. I've also had the random hanging issue. That patch will help me too to improve the reliability.
Although this is is "just a simple demo" , it's more than enough to build a working tool and iterate upon it. It's certainly the most compete demo I've found that uses Azure. thanks @pablocastro !
Hi @kikaragyozov, glad that's useful. As this repository is just a simple demo I can understand why it's not async, but I've found you can adapt it very easily. ... Hope that helps you too.
I just merged a change to the Bicep that sets PYTHON_ENABLE_GUNICORN_MULTIWORKERS to 'true' so that should be a big help.
However, I am also going to send a PR for overriding appCommandLine, as I think it might be worth experimenting with other worker classes (like gevent mentioned here). Ideally we'd do some loadtesting to determine optimal gunicorn configuration.
Here's another PR that adds a custom startup script:
https://github.com/Azure-Samples/azure-search-openai-demo/pull/464
You should be able to modify that to change the worker class. Relevant docs here: https://docs.gunicorn.org/en/latest/design.html#choosing-a-worker-type
My change is now merged, so you can easily customize the gunicorn configuration. Please do share if you find better settings than the current ones in gunicorn.conf.py.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.