st-moe-pytorch issues

Results 5 st-moe-pytorch issues

Sort by recently updated

Question on the experts' input

Hi, thanks a lot for your great work! I tried using this code on my project and I found that the input that goes to the MoE module (`x` in...

mrqorib

Question on increasing batch size and sequence length

Do you know how this giant all reduce works for giant architectures across hundreds of workers? Specifically interested in this bit of code ``` if is_distributed: ... # gather and...

jambo6

differentiable top k

IIUC, the [topk](https://github.com/lucidrains/CoLT5-attention/blob/main/colt5_attention/topk.py) in colt5_attention uses [coor_descent](https://github.com/lucidrains/CoLT5-attention/blob/main/colt5_attention/coor_descent.py#L17), and, according to the original [paper](https://openreview.net/pdf?id=IyYyKov0Aj) Eq 8 - 11, it seems to expect the input to be unnormalized. However, in the forward...

wangzizhao

Seeking Help on Loss Behavior

First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some...

guanidine

About gating_top_n

Hi, I notice there is experiment with `top_n=1` in the paper of `st-moe`. But in `st_moe_pytorch.py`, `assert top_n >= 2, 'must be 2 or more experts'` Can `top_n=1` work in...

Heihaierr

st-moe-pytorch
st-moe-pytorch copied to clipboard

Metadata

Question on the experts' input

Question on increasing batch size and sequence length

differentiable top k

Seeking Help on Loss Behavior

About gating_top_n

← Metadata

Owner

Metadata

st-moe-pytorch st-moe-pytorch copied to clipboard

Metadata

Question on the experts' input

Question on increasing batch size and sequence length

differentiable top k

Seeking Help on Loss Behavior

About gating_top_n

← Metadata

Owner

Metadata

st-moe-pytorch
st-moe-pytorch copied to clipboard