TensorRT-LLM moe router tp removed

moe router tp removed

Open megha95 opened this issue 1 year ago • 0 comments

As the title suggests, this PR removes TP (tensor parallelism) for MoE router. Duplicating router across GPUs removes an allreduce for each MoE layer. This small change leads to 4-18% speedup in decoding for Mixtral-8x7B-v0.1 (4% at bs=1, 10-18% at batch size 2-16). Measured on 2xA100-80GB.

Feb 16 '24 21:02 megha95

TensorRT-LLM TensorRT-LLM copied to clipboard

moe router tp removed

TensorRT-LLM
TensorRT-LLM copied to clipboard