fms-fsdp [PP + EP][Master Thread] Enable Pipeline Parallelism (PP) and Expert Parallelism (EP)

Background

We are starting an effort on enabling PP + EP for our training experiments on Mamba MoE. Comparing to other parallelism (FSDP, CP, TP), PP can be much more complicated to be added to an existing code base for two reasons:

It requires modification on mode side. Unlike other parallelism which requires no change on model side, PP requires model-side modifications for the model to support PP. This means we need to modify Mamba repo (details are discussed in https://github.com/foundation-model-stack/fms-fsdp/issues/134)
It requires rewriting on training side. Unlike other parallelism which can be easily added to an existing training script by adding an extra dimension of device_mesh and stack the new parallelism with existing ones, PP requires revamp of a big portion of the training script (details are discussed in [TODO])

Given the complexity, we are going to do this with a muti-stage plan and provide limited support.

Muti-Stage Plan (updated 03/25)

Stage 1a: EP + MoE Path

modify Mamba repo to support MoE Mamba
enable EP (Expert Parallel)
enable EP with fast kernel

Note: we should test this on 1d EP only (i.e. only 1 copy of each expert without duplication that requires further all-reduce)

Stage 1b: PP + Mamba Path

modify Mamba repo to support pp.
a complete revamp of train.py to support PP.

Note: 1a and 1b are independent efforts so we should proceed in parallel.

Stage 2: [PP + EP] x [Mamba MoE]

Combine the efforts from both 1a and 1b:

merge the Mamba-side modifications: MoE Mamba + PP Mamba.
merge the train script: 2d/3d device mesh to merge PP and EP.

Limited Support

Any of the limitations below can be lifted as needed, but we started with limited support on:

PP + EP only. The composability of PP with other parallelism can be very complicated, and it takes a lot of effort to maintain and update. We have no intention to use this effort to make this repo become another TorchTitan/Megatron-LM that supports an arbitrary combo of all parallelism (PP + FSDP + CP + TP + EP). For this effort, we will target a clean solution on PP+EP only.
handcrafted and hardcoded PP schedule. It is not hard to implement an automated PP split for Mamba models, but doing it manually with hardcoded splitting has two benefits: 1. it is much safer and more clear to have a hardcoded physical schedule. 2. It is easier to modify and try different schedules to tune the model performance and find the best splitting points. Since we will likely have a fixed setup (a small set of fixed amount of GPUs with fixed PP size), we are better with a non-automated fashion. And we can always add the automation as needed.

Mar 23 '25 18:03 lchu6

@garrett361 @fabianlim @daviswer @AdnanHoque cc @raghukiran1224 @dakshiagrawal

Mar 24 '25 15:03 lchu6

@lchu6 would it make sense to start with MoE only, since MoE is the main thing that complicates parallelism strategy. We can add Mamba2 later.

Mar 25 '25 05:03 raghukiran1224

@AdnanHoque has been working on benchmarking the kernels that he developed with Less, so it may be worth checking how the comms and compute look at the choices we make for the hyper parameters.

Mar 25 '25 05:03 raghukiran1224

MoE is the main thing that complicates parallelism strategy.

I'm not sure that is true. Both Linsong and I need to get more familiar with the pipeline APIs in general, and adapting to them requires some thought.

worth checking how the comms and compute look

Yeah, this would be good to know for sure. And any code pointers would be also helpful.

Mar 25 '25 13:03 garrett361

@raghukiran1224 MoE-only can be stage0 no problem, and it is something we can start today. I was actually thinking about the same thing last night and I was thinking to make it a stage-1b to proceed together with stage1 PP-only, as both works can be proceed in parallel without any dependency. Although that's not because it will complicate parallelism as @garrett361 mentioned.

As we previously discussed PP is necessary because: Performance of EP with more than one EP group can be bad, and we need PP to help scale up. So if we test MoE with a proper scale that run one and only one EP group, there is no need nor dependency on PP. so this MoE-only with 1 EP group has no dependency and thus can be started today. And whenever both independent parties are ready, we can then glue both together to achieve the final goal.

I will modify the plan a little bit later.

Mar 25 '25 13:03 lchu6

@raghukiran1224 I just updated the multi-stage plan in place. Let me know what you think.

Mar 25 '25 13:03 lchu6

i have some examples with moe kernel and EP

Mar 25 '25 15:03 fabianlim

I'm not sure that is true. Both Linsong and I need to get more familiar with the pipeline APIs in general, and adapting to them requires some thought.

@garrett361 yes, I agree. However, MoE makes this sooner than dense models. The model parameters scale faster than the compute needed for them. I think we hit the scaling issue at smaller number of GPUs.

Mar 26 '25 01:03 raghukiran1224

@raghukiran1224 I just updated the multi-stage plan in place. Let me know what you think.

The only gap I see is that there is still Mamba in the stage 0 :)

Mar 26 '25 01:03 raghukiran1224

I seem to have a working EP MoE implementation here for torchtitan. Should be easy to move this into mamba-ssm, if that's where we want to go.

Mar 26 '25 16:03 garrett361

Re: stage 1a) I have MoE + EP in mamba-ssm implemented and tested here, though it doesn't use fast MoE kernels at the moment.

Apr 01 '25 21:04 garrett361