mesh
mesh copied to clipboard
Mesh TensorFlow: Model Parallelism Made Easier
Hi, To speed up training on V100 GPUs, I'd like to run mesh tf using mixed precision. While TensorFlow has an easy to use [automatic mixed precision](https://www.tensorflow.org/api_docs/python/tf/train/experimental/enable_mixed_precision_graph_rewrite) feature, it requires...
I would like to debug training/fine-tuning performance of mesh transformer on CPU/GPU. Is it possible to capture performance profile using Tensorboard? If so, is there an example or tutorial that...
When I was running the `mnist.py`, it occurred that in `mnist_dataset.py`, function `download`, `os.remove(zipped_filepath)` couldn't work due to PermissionError. Therefore, changing this code into this might works. ` try: os.remove(zipped_filepath)...
This paper [Low-Rank Bottleneck in Multi-head Attention Models](https://arxiv.org/pdf/2002.07028.pdf) suggests that we could fix the head size and keep hidden size unchanged. Could you support setting `d_k`, `d_q`, `d_v` independently instead...
Could you please set to `False` the default value of `ignore_comments`? https://github.com/tensorflow/mesh/blob/7de6e9bc9e362d082b0d8e4b04be321a25b6f0a6/mesh_tensorflow/transformer/utils.py#L766 I'm using T5 and it took me a while to find out why some of the lines in...
In the toy_model_tpu.py exampe, params['context'] is used to understand device assignments and host placements. Where is its value populated? def model_fn(features, labels, mode, params): ... if FLAGS.use_tpu: ctx = params['context']
Hi, I am using Google T5 library which is based on TensorFlow mesh for training a non-autoregressive model like Bert. The training running without a problem, but both the prediction...
I want to run mnist.py example via mpirun to use devices from different nodes, ¿it is possible actually?
Ran training successfully on TPU v2-8 TPU software version: nightly. Ran this with tensorflow 1.15
I have made changes to the mnist.py in the examples section, as documented in the GitHub I have made the changes to achieve data parallelism and model parallelism. I have...