smoe
smoe copied to clipboard
Why block the gradients of smoe gate network?
https://github.com/spcl/smoe/blob/249ef673d1929a23e5fe7c2628e1299b8c1c2e42/smoe/models/smoe_routing.py#L116
Why should "smoe_config.block_gate_grad" be set as "True" and let "grad_routing_weights=None" which cut the gradients of gating network? So how does the routing parameters in "SpatialLatentTensorGate2d" optimize?