infinity
infinity copied to clipboard
Can we ddd a timeout to the requests in the request queue?
Feature request
As an embedding service, in scenarios with high QPS (queries per second) and sensitivity to latency, if there are multiple requests piled up in Infinity's request queue and the caller has already timed out, can Infinity discard those requests that have already timed out? This would avoid unnecessary inference. In other words, provide a configuration parameter to set the timeout duration for inference requests in the request queue. If a request times out, it should not be inferred.
Motivation
"Eliminate invalid requests to improve GPU usage efficiency.
Your contribution
If adopted,we can work together.