perf: Add optimizations for deepseek in min latency mode
/bot run
PR_Github #540 [ run ] triggered by Bot
PR_Github #540 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #459 completed with status: 'FAILURE'
/bot run
PR_Github #553 [ run ] triggered by Bot
PR_Github #553 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #470 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #568 [ run ] triggered by Bot
PR_Github #568 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #482 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #596 [ run ] triggered by Bot
PR_Github #596 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #505 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #640 [ run ] triggered by Bot
/bot run --multi-gpu-test
PR_Github #648 [ run ] triggered by Bot
PR_Github #640 [ run ] completed with state ABORTED
PR_Github #648 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #547 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #701 [ run ] triggered by Bot
Hi @zongfeijing. I am currently working on refactoring the autotuning system for fused_moe at this movement (see this pull request: https://github.com/NVIDIA/TensorRT-LLM/pull/3151). I notice that you have many changes on the implementation of Fused Moe module moeOp file according to whether the min latency mode is on.
Looks like you define a second profiler and call it based on the path we are on (min latency or not). But there is no difference between the profiling in moeOp. The difference is that a standalone min-latency-specified moe kernel launcher custom op is defined. Is there anything I am missing?
Then I will apply the autotuner on the latest version of the moe op after your changes are merged.
PR_Github #701 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #587 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #703 [ run ] triggered by Bot
/bot run --disable-fail-fast
PR_Github #704 [ run ] triggered by Bot
PR_Github #703 [ run ] completed with state ABORTED
PR_Github #704 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #589 completed with status: 'FAILURE'
/bot run --disable-fail-fast
PR_Github #711 [ run ] triggered by Bot