juney-nvidia comments

Results 117 comments of


                                            juney-nvidia

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

> So I have to wait until the MRs are merged and use correct configuation Yes. > BTW, enable_attention_dp=false might cause GPU hangs in my case Pls create another github...

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

> Thank [@Kefeng-Duan](https://github.com/Kefeng-Duan) for assistance. I did the changes accordingly and ran some tests using trtllm-bench. Now I get very closed result: 207 tok/sec/user when setting batch=1. If set batch=10,...

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

> Why 253 tok/sec needs to enable MTP? If MTP is allowed, A100 can easily out-perform B200 .. Hi @ghostplant Do you mean that on A100, with MTP, it can...

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

> [@juney-nvidia](https://github.com/juney-nvidia) Oh, we can just enlarge MTP and assume 100% success ratio, then it would be very close to running a large batch inference, which has been above 500...

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

> So the point is MTP=3? It's nice to have MPT feature in production unless MTP decrease accuracy. I cannot see the accepted rate at runtime, it's hard to judge...

Same GPU build, same files, but got the error: The engine plan file is generated on an incompatible device, expecting compute 9.0 got compute 8.9, please rebuild.

@JoJoLev Hi, can you share the concrete reproducing steps to reproduce the issue? Thanks June

Lookahead decoding and multimodal input support

@lfr-0531 may provide some quick comment on this issue. June

use selected index past past key value in attention when using contin…

@Eayne Hi, since TensorRT-LLM becomes github firstly since last Monday, pls refresh your MR based on the latest main if you still want to contribute this. Thanks June

feat: add chunked context/prefill runtime option to trtllm-serve

@kaiyux @LinPoly pls help review this MR. Thanks June

Executor API: How to get throughput

Hi @khayamgondal The throughput information should be stored in [inflight_batching_stats](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/pybind/executor/bindings.cpp#L152) . Also we are moving to use [trtllm-bench](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md) to consolidate the performance benchmarking process which I would suggest you to...