laekov comments

Results 38 comments of


                                            laekov

ImportError: cannot import name 'get_args' from 'megatron'

Check out the tags [here](https://github.com/NVIDIA/Megatron-LM/tags)

CUDA memory increases after each loss.backward()

The maximum memory may get larger because of a more imbalanced load during the computation. Can you check if `torch.cuda.memory_allocated()` also gets larger here?

CUDA memory increases after each loss.backward()

I am not able to reproduce this memory footprint increase using `FMoETransformerMLP`. What is your FastMoE and PyTorch version? Do you use expert parallelism or only data parallelism? A minimum...

CUDA memory increases after each loss.backward()

Are you using a gate from FastMoE or a customized gate?

No overlapping observed when enabling Smart Scheduling

This issue is found to be caused by using default cuda stream which synchronizes all other streams. Simply using another stream in smgr for nccl can solve the problem. Credits...

No overlapping observed when enabling Smart Scheduling

The switch gate problem seems to be caused by using then old problematic stream manager in the expert counting and balancing kernels. I put torch stream into smgr and replace...

ModuleNotFoundError: No module named 'fmoe_cuda'

You are supposed to compile and install the cuda module of fastmoe using `setup.py`

Fixing CMake error

Can we leave it configurable like `-DCUDA_ARCH=xx`?