flankedge comments

Results 15 comments of


                                            flankedge

Inquiry on 0.18 Release plan and R1 TRT engine support in different branch

> Guys, thanks for the feedback, if you are on Blackwell silicon please try build&run torch flow deepseek v3 follow [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/deepseek_v3) instructions, `deepseek` branch is deprecated and no longer update...

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

Testing with 1k input seq len is ridiculous. Chat/RAG/Agent, etc. I mean, any user cases will crash the `max_num_tokens` limit. And It appears that `max_num_tokens` has a significant impact on...

Python backend: Starting multiple GPU instances is currently not feasible/possible with a global config

emmm...actually, if you print `args`, you will find `args["model_instance_device_id"]` or `args["model_instance_name"]`may be the one you want to get.

[BUG] Serving tensorrt model with CUDA graph results in weird unconsistent outputs.

We performed a test using Triton Server 25.06 (based on CUDA 12.9). The anomalous results persist, which suggests that the issue is likely unrelated to NVIDIA CUDA driver compatibility. Theoretically,...

[BUG] Serving tensorrt model with CUDA graph results in weird unconsistent outputs.

The root cause has been identified. Disabling `output_copy_stream` resolves the issue. It appears that the output copy stream does not correctly synchronize with the CUDA Graph execution stream. We look...