lingvo Any advantage of GShard over Mesh Tensorflow for MoE?

Any advantage of GShard over Mesh Tensorflow for MoE?

Open AranKomat opened this issue 4 years ago • 0 comments

Thank you very much for open-sourcing GShard!

I'm currently using MoE from Mesh Tensorflow. The design of MoE used in MTF is equivalent to that of GShard iiuc.

According to GShard paper, GShard has an advantage that one can seamlessly incorporate MoE into the TF codeset without rewriting as in MTF.

Mesh TensorFlow [23] helps the user to build large models with SPMDstyle per-operator partitioning, by rewriting the computation in a Python library on top of TensorFlow; in comparison, our approach partitions the graph in the compiler based on light-weight annotations without requiring the user to rewrite the model.

Is there any other advantage of GShard over MTF such as reducing the overhead of MTF (if any) for a larger number of experts?

If there is any substantial change in the design (e.g. hyperparameters) from MTF, please let me know :)

Nov 12 '20 16:11 AranKomat

lingvo lingvo copied to clipboard

Any advantage of GShard over Mesh Tensorflow for MoE?

lingvo
lingvo copied to clipboard