server
server copied to clipboard
Is python backend going to support asyncio?
Is your feature request related to a problem? Please describe. Is python backend going to support asyncio?
Describe the solution you'd like Coroutine has better performance than concurrency in the situation of network IO.
Describe alternatives you've considered Referring to grpc-python, we can start a python thread to run asyncio loop. We also start a c++ thread to sniffer and forward packages. Python thread and c++ thread communicate with queue.
Additional context Our business has strong reliance on asyncio. We can assign a manpower for the job.
Can you elaborate where you want support for asyncio? Python backend already supports asyncio in BLS https://github.com/triton-inference-server/python_backend#business-logic-scripting-beta. There is also a ticket on our roadmap to support asyncio for sending the responses back to the server.
https://github.com/triton-inference-server/python_backend/blob/main/src/pb_stub.cc#L482

In current python backend implementation, the coroutine is called synchronously with asyncio.run. It should has bad performance in real workspace.
We need to send RPC or http requests in python backend in a more efficient way. @Tabrizian
Can you elaborate more on the use case that you want asyncio support for? The current async io support in Python backend is only for the async BLS requests. The goal was to let the user have multiple inflight BLS requests in their Python model. There is another feature on our roadmap to allow sending the response of requests in Python backend using async. Do you think this feature is suitable for your use case or you are describing a separate feature?
My feature is to use async to send RPC/HTTP requests(download images, etc.) without blocking the python thread. @Tabrizian
I see.. I am wondering if you do not want to block when the execute function returns then when do you need the results from your RPC/HTTP requests and how are you going to use them? Is it something that you want to run in the background regardless of the requests that are being executed on your model?
For example, a ensemble backend send 10 requests to a python backend. Each request should download a image using async. The python thread should download 10 images simultaneously, rather than do it synchronously.
If your requests are batched, you can use async in with the current version of Python backend to create a coroutine for each request to download the images asynchronously. You can also create 10 model instances of your Python model so that each request will be executed independently and the image download happens asynchronously. https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#instance-groups
It seems a temporary solution. However we want to use a model instance with asyncio loop to finish the job. Coroutine must has better performance than concurrency in the situation of network IO. Make python backend be capable to run multiple requests simultaneously should be a good feature to me. @Tabrizian
Let me explain more explicitly.
[Background] Each request is a user request. Ensemble backend can push all requests to the queue of python backend. By now, python backend receives a request from the queue, executes the request, and sends back the response. Each request is executed synchronously.
[Problem] So, my problem is python backend cannot process multiple requests concurrently in a instance. Asyncio should enable concurrent processing with coroutine. However, python backend use asyncio.run to execute async function, which will cause each request is executed synchronously.
[Solution]
- Merge multiple requests into a batch request should solve the problem in some cases. However, if only one request arrives in the first window, we can only process the first request. And the following requests should wait until the first request is done. If our python function is to download a video, the latency of user request is not controllable.
- Starting multiple instances is also a solution. However, suppose we have the concurrency of 100 qps, each request should download a video. Starting 100 instances is not actual.
[Suggestion] My suggestion is to start a new repo async python backend. We can start a python thread to run asyncio loop. C++ thread will get data from triton queue, send it to python thread, and wait for the response of python thread.(This is the architecture of grpc-python) I'm wondering whether you are interested in the suggestion.
@ZhuYuJin we have made a note of your request for adding such a feature.
did we start this feature?
@manhtd98 We have implemented decoupled API support which partially addresses the feature requested here: https://github.com/triton-inference-server/python_backend#decoupled-mode
Having said that we have a feature on the road map to support full async API in python backend but it has not been scheduled.
I have error with asyncio and memory not auto clean after predict.
@Tabrizian Any updates on this?
@Tabrizian We have a similar use case (although in our case we're waiting on BLS-type triton server requests not downloads). The decoupled model is a viable alternative but only if it can be called from an ensemble. Is calling decoupled models from ensembles supported right now?
Any updates on this? I have the same problem, serial processing of requests cause insufficient performance
@ZhuYuJin Is there any progress?
Sorry for the delayed response. This is on our roadmap but it has not been scheduled yet. We'll let you know as soon as there is an updated.