lingvo where is GPipe source code?

Hi, I am looking for GPipe source code, but only found this: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py

Maybe I am wrong but I think the source code of GPipe should include functionalities such as deciding and splitting computation graph segments and calling TensorFlow APIs to execute the computation graphs in a pipelining manner with memory reuse/forward recomputation. The link above is more like an interface to call GPipe (GPipe library), instead of GPipe source code.

Particularly I am interested to see what TensorFlow API calls does GPipe use, i.e. how does GPipe realize the design using TensorFlow runtime. Unfortunately, I didn't find code related to that. Is GPipe truly open source or just open to using?

Jun 28 '19 00:06 feiwang3311

GPipe is tightly integrated to lingvo framework. If you look from https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L422 You will realize how the assignment of the layers is performed.

It makes sense to design with a high-level abstraction over just releasing gpipe and expect the users to integrate it to the model they need. I believe the authors took the high-level abstraction approach here and integrated it to lingvo.

So to answer your question - gpipe is not a standalone library. If you want to use it, you will have to build on top of lingvo.

@bignamehyp and @jonathanasdf can tell us more about this.

Jul 01 '19 21:07 msharmavikram

Thanks for clearing that up!

If you don't mind, can you address a few more questions related to the realization of the GPipe arXiv paper? From the StackedRecurrent here: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L443 I can understand how the pipeline is set up in multiple works handling multiple micro-batches. The with tf.device(devices[0]) (you previously pointed me to) and the with tf.device(devices[-1]): (a few lines after) handles the splitting of inputs into micro-batches and the joining of outputs from micro-batches. However, I cannot find a few key realizations.

How does the GPipe library control the recomputation of the forward pass? This is an important feature mentioned in the GPipe arXiv paper to save GPU memory. I am super interested in how is this done in GPipe, or what TensorFlow API functions are used to carry out this functionality.
The GPipe arXiv paper described a computation graph partition process and the used heuristics (minimal variance in computation costs in each partition). Is this functionality (partition and heuristics) also in lingvo somewhere?

Jul 02 '19 15:07 feiwang3311

Sorry for the delay reply. I was away from vacation. Thank you very much for your interests in GPipe.

Re-compute is implemented in just one line.

https://github.com/tensorflow/lingvo/blob/46324be3ac7faa12663337624326238e65a2e57c/lingvo/core/recurrent.py#L947

We just open sourced an example heuristics per your request: https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/gpipe.py#L194

You can find example usage in the unit test: https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/gpipe_test.py#L154

Jul 18 '19 04:07 bignamehyp

@bignamehyp Thanks for your reply!

I have been digging in the codebase (especially the lingvo/core/recurrent.py and the lingvo/core/gpipe.py) and learned a lot about GPipe. I am also trying to re-create the ResNet101 evaluation in the GPipe arXiv paper, starting from the small example in the lingvo/core/gpipe_test.py. I have a few questions:

How to run the test in gpipe_test.py on GPUs? To run the example in gpipe_test.py on GPUs, I added use_gpu=True in self.session() on this line: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py#L143 and overrode the device assignment by forcing the cell_devices/devices to be GPU strings here: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L286 and here: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L484 However, I get Error Messages like this (when running on 2 or 4 splits):

E tensorflow/stream_executor/stream.cc:332] Error recording event in stream: error recording CUDA event on stream 0x75bfa50: CUDA_ERROR_INVALID_HANDLE: invalid resource handle; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.

How to add BatchNormalization layers to the simple model? I tried to add either BatchNormLayerNoPadding layers or BatchNormLayer layers to the _SimpyLayer, but I run into errors like this:

If I add BatchNormLayer, I get:

TypeError: In op 'layer_0/bn/bn/AssignMovingAvg', input types ([tf.float32, tf.float32]) are not compatible with expected types ([tf.float32_ref, tf.float32])

If I add BatchNormLayerNoPadding, I get:

File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3660, in _get_operation_by_name_unsafe return self._nodes_by_name[name] KeyError: u'layer_0/bn/add_1'

Sorry to bother you with my own bugs. It would also help greatly if some of the code for evaluation in the arXiv paper https://arxiv.org/abs/1811.06965 can be open-sourced. I am not very familiar with Lingvo in general, and I am not sure how to wrap the small example in gpipe_test.py into a registered model.

Jul 18 '19 15:07 feiwang3311

lingvo lingvo copied to clipboard

where is GPipe source code?

lingvo
lingvo copied to clipboard