Kaiyu Xie
Kaiyu Xie
@sleepwalker2017 Thanks for reporting the issues. Is it possible to provide more commands and steps to reproduce the issue, especially the 2nd point?
Hi @Ourspolaire1 , the most suggested way currently is use `trtllm-build` command to build the models you want to benchmark, and use `gptManagerBenchmark` to benchmark it, please see the documents:...
Hi @sleepwalker2017 , can you please help check if the issue has been fixed on the latest main branch? Thanks.
Clearer error information has been added to the latest main branch, closing. Please let us know if there are any questions, thanks.
@XiaobingSuper Thanks for the support! We will review the changes in the internal codebase and get back to you.
Hi @0xymoro , when inflight batching is enabled, that said, if you enable gpt attention plugin, fused mha, paged kv cache and remove input padding, (they are all enabled by...
> if 8192 is max num tokens that means at most ~4 requests with 2048 input can be run in parallel @0xymoro Note that generation requests will most likely occupy...
Hi @bprus , can you try again on the latest main branch? We've integrated several optimizations including [multiple profiles](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md#multiple-profiles), which should minimize the impacts of `max_batch_size` to the kernel selection....
@VitalyPetrov Thanks for providing the details, I'll try reproduce the issue. Are you using a 40GB, or 80GB A100? Did you observe the actual runtime batch size? A potential reason...
> but there has been an AttributeError: 'NoneType' object has no attribute 'StatusCode'. @xxyux It's very likely that the issue is in one of the dependencies of TensorRT-LLM backend. I...