ghostplant comments

Results 272 comments of


                                            ghostplant

tutel is slower than the naive p2p using 2DH for small scale

Yes, 2DH is faster to serve large scales only.

How to implement Fairseq-MoE training checkpoint like Swin-MoE?

Hi, you may need to rename the save_dir to make per-device process save to a unique destination: https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/dataclass/configs.py#L645 You can change the default save_dir path to: `f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}"` or `f"checkpoints-dev{os.environ.get('RANK',...

INTERNAL ASSERT FAILED

1. Does `print(torch.cuda.get_arch_list())` include `sm_86`? 2. Can you try `export USE_NVRTC=1` before running the example? 3. Are you sure there is no other old CUDA installed so that an old...

INTERNAL ASSERT FAILED

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include `sm_86`. You also need to update...

INTERNAL ASSERT FAILED

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100...

updt init

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

updt init

> > self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) > > Is leaving `num_global_experts` a buffer the only issue with the PR? > > We could remove it from this PR and open...

updt init

To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if `bias=True` which should be the...

Training with Data and Expert Parallelism

Tutel MoE works just like it is in DDP modes for data loaders and models, so you can safely stack the Tutel MoE layer in your original forward graph design...

Training with Data and Expert Parallelism

Whatever type of parallel you choose, it doesn't change how to use MoE layer out of the box. Different types of parallel just change the MoE internal parallelism to use,...