Yukun He comments

Results 27 comments of


                                            Yukun He

perf: Add optimizations for deepseek in min latency mode

Hi @zongfeijing. I am currently working on refactoring the autotuning system for fused_moe at this movement (see this pull request: https://github.com/NVIDIA/TensorRT-LLM/pull/3151). I notice that you have many changes on the...

[None][feat] Unify nvfp4 gemm backend

To simplify the nested tuning process, we want : * The inner op is not forced to have forward and get_valid_tactics to be implemented (whether it is a tunable one...

[None][feat] Unify nvfp4 gemm backend

> Sure, I will try it. Thanks a lot for the effort. I have just pushed another commit to clean the code and make UT work. Because this is the...

[None][feat] Unify nvfp4 gemm backend

Hi @Wong4j. Thanks a lot for the effort! I just moved the common code changes in AutoTuner to a standalone PR #9348 because it might be required by other tunable...

[None][fix] Autotune trtllm moe with same distribution across ranks

Looks like this bug also reflects some other issues associated with the distribution across ranks https://nvbugspro.nvidia.com/bug/5680133. Maybe you will have some ideas or comments on this @rosenrodt. Thanks a lot...

This is a draft PR to validate CI for enabling cold L2 applied in #8779

/bot run --disable-fail-fast

[https://nvbugs/5676748][fix] Cherry-pick #9336: Fix mismatched nvfp4 gemm sf shape.

/bot run --disable-fail-fast

[https://nvbugs/5676748][fix] Cherry-pick #9336: Fix mismatched nvfp4 gemm sf shape.

/bot skip --comment "Pipeline has already been cleaned and only change the pre-commit configs."

[None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner.

/bot run

[None][fix] Fix on-disk cache and revise logger/statistics for AutoTuner.

/bot run --disable-fail-fast