megablocks
megablocks copied to clipboard
Hello, I am using Eleuther AI's gpt-neox implementation with megablocks, but I get 2 errors related to the `_LOAD_BALANCING_LOSS`. 1. the `tokens_per_expert` gives me this error at this [line](https://github.com/databricks/megablocks/blob/f1a83bd55413b02b472696b719646cf22732d070/megablocks/layers/moe.py#L34). `ValueError:...
I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot!...
I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can...
when I run `pip install megablocks` I get this: ``` clang: error: unsupported option '--ptxas-options=-v' clang: error: unsupported option '--generate-code=arch=compute_90,code=sm_90' ```
I am getting the below error upon the first step of multinode training with dMoE. Meanwhile, multinode MoE training & single node dMoE works fine. Any ideas what the problem...
@tgale96 The JetMoE technical report has mentioned how they used Megablocks with Megatrone to train the model. Then the author shared [this](https://github.com/yikangshen/megablocks) fork of the megablokcs used during the training....
I would like to request ScatterMoE feature in Megablocks https://arxiv.org/abs/2403.08245 https://github.com/shawntan/scattermoe
I fllow the next step: - run docker build . -t megablocks-dev - and then bash docker.sh to launch the container. When I run `moe_46m_8gpu.sh` to test, it reported the...
Given that MegaBlocks is highly optimized for sparse MoE models like Mixtral, I am requesting support for a variant recently termed as MoDE by Google DeepMind. Benefits include much faster...
Is it possible to import the dmoe model itself into another training script without training via megatron?