megablocks
megablocks copied to clipboard
Hi, I saw you mentioned that you used your fork of Megatron-LM for training - could you please provide scripts and hyperparams used for the SFT of DBRX? It would...
Hi there, Great work with dMoE! I'm trying to test dMoE with regular DDP + pytorch AMP(BF16) and I get the following error: ```bash optimizer_state["found_inf_per_device"] = self._unscale_grads_( File "/miniconda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line...
The base Megatron-LM repo provides unsharding scripts for the models which can be used after training. I didn't find any such scripts in the repo. Would it be possible to...
The loss func is always `moe_loss_func` as can be seen [here](https://github.com/stanford-futuredata/Megatron-LM/blob/3a9e3d8de308e6f6398b59d16a8bd7177374f121/pretrain_gpt.py#L128). But the loss is only calculated when training, which can be seen [here](https://github.com/stanford-futuredata/megablocks/blob/f05609ce69c1e1a7dd008c49cf435ef74df84b69/megablocks/layers/moe.py#L427-L428). We should fallback to the original...
Thanks for the excellent work. Following the comment in #59, I am trying to train `dmoe_760m` using 16 GPUs (2 nodes) by changing distributed arguments to set up for two...
To my understanding -- and please correct me if I am wrong about this -- there is no mechanism to selectively compute routing logits in fp32, as is suggested in...
I load the same model trained with megatron + megablocks, and I found the load_balancing_loss is slightly different. When I increase the pipeline_parallel_size, the load_balancing_loss is also increasing. Is it...
When I run `pip install megablocks`, I seem to be getting this error: `RuntimeError: ('Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and...
What the title says. In `layers/dmoe.py`: ```python class ParallelDroplessMLP(moe.ParallelMLP): def __init__(self, args : Arguments): super(ParallelDroplessMLP, self).__init__(args) #