laekov
laekov
Check out the tags [here](https://github.com/NVIDIA/Megatron-LM/tags)
The maximum memory may get larger because of a more imbalanced load during the computation. Can you check if `torch.cuda.memory_allocated()` also gets larger here?
I am not able to reproduce this memory footprint increase using `FMoETransformerMLP`. What is your FastMoE and PyTorch version? Do you use expert parallelism or only data parallelism? A minimum...
Are you using a gate from FastMoE or a customized gate?
This issue is found to be caused by using default cuda stream which synchronizes all other streams. Simply using another stream in smgr for nccl can solve the problem. Credits...
The switch gate problem seems to be caused by using then old problematic stream manager in the expert counting and balancing kernels. I put torch stream into smgr and replace...
You are supposed to compile and install the cuda module of fastmoe using `setup.py`
Can we leave it configurable like `-DCUDA_ARCH=xx`?