mesh
mesh copied to clipboard
Mesh TensorFlow: Model Parallelism Made Easier
add missing condition check of `context.train` in `attention()`, which can be `none`.
https://github.com/tensorflow/mesh/blob/fbf7b1e547e8b8cb134e81e1cd350c312c0b5a16/mesh_tensorflow/transformer/moe.py#L935 I try load-balanced loss in my project and find load-balanced loss does not help loss converge. Does it only balance the load, but does not help the loss convergence,...
What is the future of this project? Should we invest time in it or is considered internally archived in favor of auto XLA and other frontends like JAX? Last release...
Move `convert_to_tensor`, `convert_to_tensor_v1`, `convert_to_tensor_v1_with_dispatch`, `convert_to_tensor_v2_with_dispatch`, and `convert_to_tensor_v2` into `tensor_conversion_registry`. Also renaming `tensor_conversion_registry` to `tensor_conversion` to match.
File "/root/softwares/anaconda3/envs/tf115/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: failed to allocate memory [[{{node bert/encoder/block_0/feedforward_1/dense_1/scalar_mul/parallel_4/mul}}]] Hint: If you want to see a list of...
While running mnist.py in google colab. Anyone faced this issue or know how to resolve this ?
https://github.com/tensorflow/mesh/blob/6b31c0fc9daf185aae2422976487f8db08fc7369/mesh_tensorflow/transformer/moe.py#L1694 It should not cause any issues I guess. Just unnecessary computation?
Is it possible to use devices that are on different machines? For example, in Horovod I can specify the IP addresses of multiple machines and do data parallelism across them....
Adding a new Gradient Estimator for Routing using REINFORCE with a leave-one-out baseline.
Hey guys, thanks so much for releasing all the t51.1 and mt5 weights! I'm currently working on porting all these models to huggingface's transformers. Is there anyway to run mesh...