st-moe-pytorch differentiable top k

differentiable top k

Open wangzizhao opened this issue 4 months ago • 0 comments

IIUC, the topk in colt5_attention uses coor_descent, and, according to the original paper Eq 8 - 11, it seems to expect the input to be unnormalized.

However, in the forward of TopNGating, it seems that normalized score is passed into the topk.

I wonder if I misunderstood something and whether I should use normalized or unnormalized score here.

Feb 20 '24 03:02 wangzizhao