Martin Damgaard Nielsen

Results 3 issues of Martin Damgaard Nielsen

This dramatically reduces memory requirements, as there will no longer be kept an extra copy of the concatenated weight tensor for each timestep (During backprop)

https://github.com/tensorflow/mesh/blob/6b31c0fc9daf185aae2422976487f8db08fc7369/mesh_tensorflow/transformer/moe.py#L1694 It should not cause any issues I guess. Just unnecessary computation?