fastmoe issues

pytest error

3

I find out the moe is 0, but i don't know why ![image](https://github.com/laekov/fastmoe/assets/51167745/97b92338-ef8f-494f-a73f-94530da91cf1)

R-QinQ

ImportError: cannot import name 'get_args' from 'megatron'

5

No functions named get_args() in megatron The funxitons in __init__ doesn't include get_args r""" A set of modules to plugin into Megatron-LM with FastMoE """ from .utils import add_fmoe_args from...

peter-fei

During inference, the output of noisy gate is nan.

5

The training process proceeds smoothly; however, an issue arises during inference as the **noise_stddev** becomes zero when **self.training** is False, leading to an error when computing the **load**. Should we...

zqhang

ModuleNotFoundError: No module named 'fmoe_cuda'

1

**Describe the bug** I adapt fmoe into Megatron as the tutorial and want to run a script to train gpt. But when I run ```pretrain_gpt.sh```, it raises the error called...

Taskii-Lei

how to use balance loss?

1

how to apply balance loss? can u add it to the example 'transformer-xl'?

Heihaierr

No overlapping observed when enabling Smart Scheduling

8

**Describe the bug** I am trying to create a minimal run-able example of Smart Scheduling proposed by the FasterMoE paper. However, when I profile the example using Nsight Systems, it...

chenyu-jiang

MoE L2 norm reduce in Megatron

3

I notice the L2 norm for experts is reduced twice in model parallel group, please see: https://github.com/laekov/fastmoe/blob/cd8372b3a8a5e73d46d2b463ec30995631cfc181/examples/megatron/clip-grad-v2.2.patch#L44C2-L44C2. It is a good ideas to add up the square gradients of all...

blankde

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example)

3

**Describe the bug** When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set. **To Reproduce** Steps to reproduce...

chenwydj