megablocks
megablocks copied to clipboard
Hi, this is awesome work. I'm wondering if there is a minimal way to integrate megablocks into transformers codebase for the mixtral architecture? Would simply replacing the [`MixtralSparseMoeBlock` ](https://github.com/huggingface/transformers/blob/aa4a0f8ef37eb5d42b4e3810f37e554585c90d41/src/transformers/models/mixtral/modeling_mixtral.py#L854) with...
It's more a question than an issue. The tensor [w2](https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/layers/mlp.py#L341C9-L341C50) of class SparseMLP has the same shape as the w1, is it because of the DSD operation? like, it requires...
While working with the load_checkpoint function in the file `third_party/Megatron-LM/megatron/checkpointing.py`, I noticed that the condition on [line 585](https://github.com/stanford-futuredata/Megatron-LM/blob/3a9e3d8de308e6f6398b59d16a8bd7177374f121/megatron/checkpointing.py#L585): ``` if args.fp16 and optimizer is not None: ``` should be modified...
Hi, The paper results are very impressive. But I notice the comparision is againest top-1 routing. Do you have results against top-2 routing? This would make the comparison more challenging...
Hi, I see that there is a script for training Mixtral, but not one for fine-tuning. Could you please provide it? The whole community is having a lot of issues...
# What does this PR do? bump torch to # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/databricks/megablocks/blob/dev/CONTRIBUTING.md)? - [ ] Is this change a documentation change...
# What does this PR do? Add type checking. # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/databricks/megablocks/blob/dev/CONTRIBUTING.md)? -...
iteration 1000/ 20000 | consumed samples: 512000 | elapsed time per iteration (ms): 336.3 | learning rate: 1.495E-04 | global batch size: 512 | load balancing loss: 9.743530E-02 | lm...