TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

perf: Add optimizations for deepseek in min latency mode

Open zongfeijing opened this issue 9 months ago • 18 comments

zongfeijing avatar Mar 26 '25 07:03 zongfeijing

/bot run

zongfeijing avatar Mar 26 '25 07:03 zongfeijing

PR_Github #540 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 07:03 niukuo

PR_Github #540 [ run ] completed with state FAILURE /LLM/main/L0_MergeRequest_PR pipeline #459 completed with status: 'FAILURE'

niukuo avatar Mar 26 '25 07:03 niukuo

/bot run

zongfeijing avatar Mar 26 '25 08:03 zongfeijing

PR_Github #553 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 08:03 niukuo

PR_Github #553 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #470 completed with status: 'FAILURE'

niukuo avatar Mar 26 '25 09:03 niukuo

/bot run --disable-fail-fast

zongfeijing avatar Mar 26 '25 10:03 zongfeijing

PR_Github #568 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 10:03 niukuo

PR_Github #568 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #482 completed with status: 'FAILURE'

niukuo avatar Mar 26 '25 14:03 niukuo

/bot run --disable-fail-fast

zongfeijing avatar Mar 26 '25 14:03 zongfeijing

PR_Github #596 [ run ] triggered by Bot

niukuo avatar Mar 26 '25 14:03 niukuo

PR_Github #596 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #505 completed with status: 'FAILURE'

niukuo avatar Mar 26 '25 18:03 niukuo

/bot run --disable-fail-fast

zongfeijing avatar Mar 27 '25 05:03 zongfeijing

PR_Github #640 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 27 '25 05:03 tensorrt-cicd

/bot run --multi-gpu-test

zongfeijing avatar Mar 27 '25 08:03 zongfeijing

PR_Github #648 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 27 '25 08:03 tensorrt-cicd

PR_Github #640 [ run ] completed with state ABORTED

tensorrt-cicd avatar Mar 27 '25 08:03 tensorrt-cicd

PR_Github #648 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #547 completed with status: 'FAILURE'

tensorrt-cicd avatar Mar 27 '25 13:03 tensorrt-cicd

/bot run --disable-fail-fast

zongfeijing avatar Mar 30 '25 12:03 zongfeijing

PR_Github #701 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 30 '25 12:03 tensorrt-cicd

Hi @zongfeijing. I am currently working on refactoring the autotuning system for fused_moe at this movement (see this pull request: https://github.com/NVIDIA/TensorRT-LLM/pull/3151). I notice that you have many changes on the implementation of Fused Moe module moeOp file according to whether the min latency mode is on.

Looks like you define a second profiler and call it based on the path we are on (min latency or not). But there is no difference between the profiling in moeOp. The difference is that a standalone min-latency-specified moe kernel launcher custom op is defined. Is there anything I am missing?

Then I will apply the autotuner on the latest version of the moe op after your changes are merged.

hyukn avatar Mar 30 '25 13:03 hyukn

PR_Github #701 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #587 completed with status: 'FAILURE'

tensorrt-cicd avatar Mar 30 '25 15:03 tensorrt-cicd

/bot run --disable-fail-fast

zongfeijing avatar Mar 30 '25 15:03 zongfeijing

PR_Github #703 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 30 '25 15:03 tensorrt-cicd

/bot run --disable-fail-fast

zongfeijing avatar Mar 30 '25 16:03 zongfeijing

PR_Github #704 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 30 '25 16:03 tensorrt-cicd

PR_Github #703 [ run ] completed with state ABORTED

tensorrt-cicd avatar Mar 30 '25 16:03 tensorrt-cicd

PR_Github #704 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #589 completed with status: 'FAILURE'

tensorrt-cicd avatar Mar 30 '25 18:03 tensorrt-cicd

/bot run --disable-fail-fast

zongfeijing avatar Mar 31 '25 00:03 zongfeijing

PR_Github #711 [ run ] triggered by Bot

tensorrt-cicd avatar Mar 31 '25 01:03 tensorrt-cicd