megablocks issues

[Do not Merge] Add CI/CD Milestones

To setup CI/CD for MegaBlocks - [x] testing #127 - [x] fix setup.py #128 - [ ] Format + lint #131 - [ ] Type checking - [ ] Cut...

eitanturok

fix CAPACITY_FACTOR for moe

3

the CAPACITY_FACTOR is set improperly. ![image](https://github.com/user-attachments/assets/1417a0d3-e619-440a-b542-29e17093652d)

NeosZhang

when I run /exp/dmoe/dmoe_46m_8gpu.sh，I encountered the following Error during evaluation. ![image](https://github.com/user-attachments/assets/21b98faa-b7f2-4146-8bbf-fb7d292cde95) the func **save_load_balancing_loss** is bypassed during evaluation.

NeosZhang

Routing

1

Is the router implemented the noisy top k routing suggested by the [OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER](https://arxiv.org/pdf/1701.06538) paper? In the router code you seem to apply the...

alexliap

Illegal memory access on non-0 cuda devices from `histogram`

When the input tensor is not on device 0, `histogram` causes an illegal memory access which prevents `indices_and_bins` from being computed correctly on a model & inputs which aren't on...

phillip-kravtsov

Can we change self.blocking in dmoe.py from 128 to 64?

2

I use megablocks to implement a fine-granded moe, the ffn_hidden_size is divisible by 64, but is not divisible by 128, can we change it to 64? Thanks a lot

seanM29

amp_C undefined symbol after installing Megablocks

I am trying to setup and use megablocks to train MoE models, but I see the following error: ``` Traceback (most recent call last): File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/pretrain_gpt.py", line 8, in from...

RachitBansal

what devices are supported?

7

hello, i have tried to use megablocks in V100 + pytorch2.4.0+cu121, but get error with "cannot support bf16". If i use megablocks in fp32, i get error "group gemm must...

Guodanding