juney-nvidia
juney-nvidia
> So I have to wait until the MRs are merged and use correct configuation Yes. > BTW, enable_attention_dp=false might cause GPU hangs in my case Pls create another github...
> Thank [@Kefeng-Duan](https://github.com/Kefeng-Duan) for assistance. I did the changes accordingly and ran some tests using trtllm-bench. Now I get very closed result: 207 tok/sec/user when setting batch=1. If set batch=10,...
> Why 253 tok/sec needs to enable MTP? If MTP is allowed, A100 can easily out-perform B200 .. Hi @ghostplant Do you mean that on A100, with MTP, it can...
> [@juney-nvidia](https://github.com/juney-nvidia) Oh, we can just enlarge MTP and assume 100% success ratio, then it would be very close to running a large batch inference, which has been above 500...
> So the point is MTP=3? It's nice to have MPT feature in production unless MTP decrease accuracy. I cannot see the accepted rate at runtime, it's hard to judge...
@JoJoLev Hi, can you share the concrete reproducing steps to reproduce the issue? Thanks June
@lfr-0531 may provide some quick comment on this issue. June
@Eayne Hi, since TensorRT-LLM becomes github firstly since last Monday, pls refresh your MR based on the latest main if you still want to contribute this. Thanks June
@kaiyux @LinPoly pls help review this MR. Thanks June
Hi @khayamgondal The throughput information should be stored in [inflight_batching_stats](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/executor/bindings.cpp#L152) . Also we are moving to use [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md) to consolidate the performance benchmarking process which I would suggest you to...