Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

MoE kernel

Open ByronHsu opened this issue 1 year ago • 6 comments

🚀 The feature, motivation and pitch

Currently the most popular library might be https://github.com/databricks/megablocks. Would be interesting if we can implement it in triton and make it HF compatible

Alternatives

No response

Additional context

No response

ByronHsu avatar Sep 04 '24 06:09 ByronHsu

Will do a research on this more, if anyone has any insights on what could/should be implemented, resp. details on to how, cc me.

S1ro1 avatar Sep 05 '24 08:09 S1ro1

Maybe a preliminary would be to support for example mixtral/nllb_moe from huggingface, to have the integration ready when the layers are done?

S1ro1 avatar Sep 05 '24 08:09 S1ro1

@S1ro1 one straightforward idea is to parallelize expert forward (just like what megablock impl does). Right now in HF model code the MoE block is performed sequentiallyexpert-by-expert. Not sure if it's worth implementing load balancing loss too, haven't seen an actual profiling trace of MoE model training

yundai424 avatar Sep 05 '24 23:09 yundai424

@yundai424 Haven't seen one either, gonna try patching either Mixtral or Nllb with our kernels and profile it, will decide on what to do after that I guess. Implementing dMoE (dropless MoE) could also be interesting. Will try to send the profiler benchmarks tomorrow and could discuss more in depth. Also I suppose Mixtral > Nllb.

Edit: to address your comment, parallelizing the experts is certainly a low hanging fruit

S1ro1 avatar Sep 05 '24 23:09 S1ro1

@yundai424 @S1ro1 I'd like to help with this, but wanted to pin down some of the exact steps that can be taken to make the MoE layer more efficient.

Per my understanding the HF implementation of Mixtral, calls the experts sequentially because each expert can be allocated a variable number of tokens and they wanted to avoid dropping any tokens.

I guess we can start off by implementing ParallelMLPcode in this repo, but I'm not sure if this'll actually include any new triton/liger kernels. Most of the logic there seems to deal with sharding and distributing the required tensors across ranks.

pramodith avatar Oct 28 '24 17:10 pramodith

@pramodith I totally agree with starting with the MLP, however i'm currently surprisingly swamped with school so I won't have time to collaborate on this. So feel free to take this.

S1ro1 avatar Oct 29 '24 15:10 S1ro1