Can we ddd a timeout to the requests in the request queue?

Open xjpang opened this issue 10 months ago • 0 comments

Feature request

As an embedding service, in scenarios with high QPS (queries per second) and sensitivity to latency, if there are multiple requests piled up in Infinity's request queue and the caller has already timed out, can Infinity discard those requests that have already timed out? This would avoid unnecessary inference. In other words, provide a configuration parameter to set the timeout duration for inference requests in the request queue. If a request times out, it should not be inferred.

Motivation

"Eliminate invalid requests to improve GPU usage efficiency.

Your contribution

If adopted，we can work together.

Mar 04 '25 07:03 xjpang