megablocks icon indicating copy to clipboard operation
megablocks copied to clipboard

Results 40 megablocks issues
Sort by recently updated
recently updated
newest added

Hi, this is awesome work. I'm wondering if there is a minimal way to integrate megablocks into transformers codebase for the mixtral architecture? Would simply replacing the [`MixtralSparseMoeBlock` ](https://github.com/huggingface/transformers/blob/aa4a0f8ef37eb5d42b4e3810f37e554585c90d41/src/transformers/models/mixtral/modeling_mixtral.py#L854) with...

question

It's more a question than an issue. The tensor [w2](https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/layers/mlp.py#L341C9-L341C50) of class SparseMLP has the same shape as the w1, is it because of the DSD operation? like, it requires...

question

While working with the load_checkpoint function in the file `third_party/Megatron-LM/megatron/checkpointing.py`, I noticed that the condition on [line 585](https://github.com/stanford-futuredata/Megatron-LM/blob/3a9e3d8de308e6f6398b59d16a8bd7177374f121/megatron/checkpointing.py#L585): ``` if args.fp16 and optimizer is not None: ``` should be modified...

bug

Hi, The paper results are very impressive. But I notice the comparision is againest top-1 routing. Do you have results against top-2 routing? This would make the comparison more challenging...

question

Hi, I see that there is a script for training Mixtral, but not one for fine-tuning. Could you please provide it? The whole community is having a lot of issues...

question

# What does this PR do? bump torch to # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/databricks/megablocks/blob/dev/CONTRIBUTING.md)? - [ ] Is this change a documentation change...

# What does this PR do? Add type checking. # What issue(s) does this change relate to? # Before submitting - [ ] Have you read the [contributor guidelines](https://github.com/databricks/megablocks/blob/dev/CONTRIBUTING.md)? -...

iteration 1000/ 20000 | consumed samples: 512000 | elapsed time per iteration (ms): 336.3 | learning rate: 1.495E-04 | global batch size: 512 | load balancing loss: 9.743530E-02 | lm...