vllm
vllm copied to clipboard
Isn't vllm compatible with other web frameworks?
Dear author, At the bottom I used the flask framework to build the web service, but there were some strange phenomena (below). Isn't vllm compatible with other web frameworks?
phenomena : ①The first request to enter can be reasoned normally; ②Second request, vllm directly calls abort_request; ③ After subsequent requests are Received, vllm prints a log saying Received request, but no more inference is performed.
MY ENV: flask rtx4090 vllm0.2.0 python3.8 cuda11.8
same question
same question
flask is acceptable for vllm0.1.4, but not for later versions. The core code of vllm has changed significantly since 0.1.4, so later versions may not be compatible with other web synchronization frameworks.
same question
flask is acceptable for vllm0.1.4, but not for later versions. The core code of vllm has changed significantly since 0.1.4, so later versions may not be compatible with other web synchronization frameworks.
Thank you! Do you think it's possible to use Flask to call the VLLM API?
same question
flask is acceptable for vllm0.1.4, but not for later versions. The core code of vllm has changed significantly since 0.1.4, so later versions may not be compatible with other web synchronization frameworks.
Thank you! Do you think it's possible to use Flask to call the VLLM API?
vllm returns asynchronous iterators, and even if you call vllm's api, what you receive is still an asynchronous iterator, and you need flask to process it. So I personally don't think that's feasible.
Understood. Beyond just using its API, is there another way to integrate vllm? I aim to deploy it on my server and provide services via an iOS app.
Understood. Beyond just using its API, is there another way to integrate vllm? I aim to deploy it on my server and provide services via an iOS app.
of course you can. you can refer:
https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/api_server.py
As far as I know, FastChat is compatible with OpenAI API.
Guys I need a help regarding this...
- I am willing to integrate VLLM in myproject and want to serve as an API.
- Before this I was using Flask with
llama_cpp_python
but that currently doesn't support the batched inference. - So, when we make two parallel request on my flask API, it crashes (as expected)
- I am little unsure about how to do this with VLLM.
I am thinking of the following.
- I think, on my Linux server, I should run the
api_server.py
on the local IP and port.- Then I think I will be publically be able to access that URL from my application.
My question is: I that recommended approach? I know it sounds obvious, but I am not very experienced with web serving so will you guys please help me out here? This way I won't have to use Flask.
Thank you 🙏🏻