TensorRT-LLM
TensorRT-LLM copied to clipboard
moe router tp removed
As the title suggests, this PR removes TP (tensor parallelism) for MoE router. Duplicating router across GPUs removes an allreduce for each MoE layer. This small change leads to 4-18% speedup in decoding for Mixtral-8x7B-v0.1 (4% at bs=1, 10-18% at batch size 2-16). Measured on 2xA100-80GB.