Martin Damgaard Nielsen
Results
3
issues of
Martin Damgaard Nielsen
This dramatically reduces memory requirements, as there will no longer be kept an extra copy of the concatenated weight tensor for each timestep (During backprop)
https://github.com/tensorflow/mesh/blob/6b31c0fc9daf185aae2422976487f8db08fc7369/mesh_tensorflow/transformer/moe.py#L1694 It should not cause any issues I guess. Just unnecessary computation?