st-moe-pytorch
st-moe-pytorch copied to clipboard
Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch
Hi, thanks a lot for your great work! I tried using this code on my project and I found that the input that goes to the MoE module (`x` in...
Do you know how this giant all reduce works for giant architectures across hundreds of workers? Specifically interested in this bit of code ``` if is_distributed: ... # gather and...
IIUC, the [topk](https://github.com/lucidrains/CoLT5-attention/blob/main/colt5_attention/topk.py) in colt5_attention uses [coor_descent](https://github.com/lucidrains/CoLT5-attention/blob/main/colt5_attention/coor_descent.py#L17), and, according to the original [paper](https://openreview.net/pdf?id=IyYyKov0Aj) Eq 8 - 11, it seems to expect the input to be unnormalized. However, in the forward...
First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some...
Hi, I notice there is experiment with `top_n=1` in the paper of `st-moe`. But in `st_moe_pytorch.py`, `assert top_n >= 2, 'must be 2 or more experts'` Can `top_n=1` work in...