Kaiyu Xie comments

Results 71 comments of


                                            Kaiyu Xie

Question about configurations of runtime arguments

@sleepwalker2017 Thanks for reporting the issues. Is it possible to provide more commands and steps to reproduce the issue, especially the 2nd point?

How to test the benchmark of Llama3 and Vicuna2 of TensorRT-LLM by benchmark.py

Hi @Ourspolaire1 , the most suggested way currently is use `trtllm-build` command to build the models you want to benchmark, and use `gptManagerBenchmark` to benchmark it, please see the documents:...

gptManagerBenchmark seems to go into a dead loop with GPU usage 0%

Hi @sleepwalker2017 , can you please help check if the issue has been fixed on the latest main branch? Thanks.

Unsupported auto parallel + int4 quantization on models

Clearer error information has been added to the latest main branch, closing. Please let us know if there are any questions, thanks.

enable medusa int8 weight only quantization

@XiaobingSuper Thanks for the support! We will review the changes in the internal codebase and get back to you.

Max num tokens & max batch size sanity checks

Hi @0xymoro , when inflight batching is enabled, that said, if you enable gpt attention plugin, fused mha, paged kv cache and remove input padding, (they are all enabled by...

Max num tokens & max batch size sanity checks

> if 8192 is max num tokens that means at most ~4 requests with 2048 input can be run in parallel @0xymoro Note that generation requests will most likely occupy...

Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size

Hi @bprus , can you try again on the latest main branch? We've integrated several optimizations including [multiple profiles](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md#multiple-profiles), which should minimize the impacts of `max_batch_size` to the kernel selection....

`max_batch_size` seems to have no impact on model performance

@VitalyPetrov Thanks for providing the details, I'll try reproduce the issue. Are you using a 40GB, or 80GB A100? Did you observe the actual runtime batch size? A potential reason...

Input tensor 'host_sink_token_length' not found when launch llama2-7b.

> but there has been an AttributeError: 'NoneType' object has no attribute 'StatusCode'. @xxyux It's very likely that the issue is in one of the dependencies of TensorRT-LLM backend. I...