tensorrtllm_backend Feature Request: Set maximum number of in flight

Feature Request: Set maximum number of in flight

Open TheCodeWrangler opened this issue 10 months ago • 1 comments

When unexpected large bursts in requests come to my application I would like to be able to limit the number of requests that will be accepted by trtllm backend. I would like to be able to REJECT future requests if the number of active requests for a specific backend exceeds a threshold

I have tried with

dynamic_batching {
  default_queue_policy {
    timeout_action: REJECT
    max_queue_size: 30  
  }
}

But would like to achieve this behavior so that i can better balance my load (and not have one instance with a large backlog)

Apr 17 '24 14:04 TheCodeWrangler

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Feature Request: Set maximum number of in flight

tensorrtllm_backend
tensorrtllm_backend copied to clipboard