tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Feature Request: Set maximum number of in flight
When unexpected large bursts in requests come to my application I would like to be able to limit the number of requests that will be accepted by trtllm backend. I would like to be able to REJECT future requests if the number of active requests for a specific backend exceeds a threshold
I have tried with
dynamic_batching {
default_queue_policy {
timeout_action: REJECT
max_queue_size: 30
}
}
But would like to achieve this behavior so that i can better balance my load (and not have one instance with a large backlog)