tutel icon indicating copy to clipboard operation
tutel copied to clipboard

How expert parameters are distributed in the cluster when using the Tutel framework?

Open luuck opened this issue 1 year ago • 1 comments

Sorry, I have some questions to ask: 1、If I set num_local_experts = 2, it means that every gpu has two experts? and the two expert parameters exist on the one gpu? 2、If I set num_local_experts = -2, it means that two gpus share one expert? and how the one expert parameters are distributed on two gpus? 3、When I use Data Parallelism with Tutel , All training processes on one gpu only can use the expert distributed on the gpu? Is it possible to have Cross-node communication through the Moe layer? 4、When I use Pipeline Parallel with Tutel , in order to reduce communication, it is best to place experts on specific gpu,Can I set the distribution of experts on which GPU by myself ?

luuck avatar Oct 30 '24 12:10 luuck

Here are the answers:

  1. Yes.
  2. Each of two GPUs will store 1/2 parameters from one expert. For examples, regarding 4 GPUs maintaining 2 experts A and B, the parameter distribution on 4 GPUs will be: 1/2 of A, 1/2 of A, 1/2 of B, 1/2 of B.
  3. Can you explain it clearly? It has already contained cross-node communication in MoE layer.
  4. Different MoE groups can be placed to specific gpu just by setting custom processed group during the creation of moe_layer(). However, within a single MoE group, expert placement is specially designed and changing it will break the distributed algorithm in it.

ghostplant avatar Oct 30 '24 21:10 ghostplant