ghostplant

Results 272 comments of ghostplant

Yes, 2DH is faster to serve large scales only.

Hi, you may need to rename the save_dir to make per-device process save to a unique destination: https://github.com/facebookresearch/fairseq/blob/da8fb630880d529ab47e53381c30ddc8ad235216/fairseq/dataclass/configs.py#L645 You can change the default save_dir path to: `f"checkpoints-dev{os.environ.get('LOCAL_RANK', 0)}"` or `f"checkpoints-dev{os.environ.get('RANK',...

1. Does `print(torch.cuda.get_arch_list())` include `sm_86`? 2. Can you try `export USE_NVRTC=1` before running the example? 3. Are you sure there is no other old CUDA installed so that an old...

This is the problem from Pytorch + CUDA not tutel. You need a pytorch built with at least cu117/118 so that torch.cuda.get_arch_list() should include `sm_86`. You also need to update...

CUDA 10.2.3 is too old and it cannot support any new GPU that is above V100 (sm_7x). CUDA 11 should support A100 related types and CUDA 12 should support H100...

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

> > self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) > > Is leaving `num_global_experts` a buffer the only issue with the PR? > > We could remove it from this PR and open...

To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if `bias=True` which should be the...

Tutel MoE works just like it is in DDP modes for data loaders and models, so you can safely stack the Tutel MoE layer in your original forward graph design...

Whatever type of parallel you choose, it doesn't change how to use MoE layer out of the box. Different types of parallel just change the MoE internal parallelism to use,...