lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

where is GPipe source code?

Open feiwang3311 opened this issue 5 years ago • 4 comments

Hi, I am looking for GPipe source code, but only found this: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py

Maybe I am wrong but I think the source code of GPipe should include functionalities such as deciding and splitting computation graph segments and calling TensorFlow APIs to execute the computation graphs in a pipelining manner with memory reuse/forward recomputation. The link above is more like an interface to call GPipe (GPipe library), instead of GPipe source code.

Particularly I am interested to see what TensorFlow API calls does GPipe use, i.e. how does GPipe realize the design using TensorFlow runtime. Unfortunately, I didn't find code related to that. Is GPipe truly open source or just open to using?

feiwang3311 avatar Jun 28 '19 00:06 feiwang3311

GPipe is tightly integrated to lingvo framework. If you look from https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L422 You will realize how the assignment of the layers is performed.

It makes sense to design with a high-level abstraction over just releasing gpipe and expect the users to integrate it to the model they need. I believe the authors took the high-level abstraction approach here and integrated it to lingvo.

So to answer your question - gpipe is not a standalone library. If you want to use it, you will have to build on top of lingvo.

@bignamehyp and @jonathanasdf can tell us more about this.

msharmavikram avatar Jul 01 '19 21:07 msharmavikram

Thanks for clearing that up!

If you don't mind, can you address a few more questions related to the realization of the GPipe arXiv paper? From the StackedRecurrent here: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L443 I can understand how the pipeline is set up in multiple works handling multiple micro-batches. The with tf.device(devices[0]) (you previously pointed me to) and the with tf.device(devices[-1]): (a few lines after) handles the splitting of inputs into micro-batches and the joining of outputs from micro-batches. However, I cannot find a few key realizations.

  1. How does the GPipe library control the recomputation of the forward pass? This is an important feature mentioned in the GPipe arXiv paper to save GPU memory. I am super interested in how is this done in GPipe, or what TensorFlow API functions are used to carry out this functionality.

  2. The GPipe arXiv paper described a computation graph partition process and the used heuristics (minimal variance in computation costs in each partition). Is this functionality (partition and heuristics) also in lingvo somewhere?

feiwang3311 avatar Jul 02 '19 15:07 feiwang3311

Sorry for the delay reply. I was away from vacation. Thank you very much for your interests in GPipe.

  1. Re-compute is implemented in just one line.

https://github.com/tensorflow/lingvo/blob/46324be3ac7faa12663337624326238e65a2e57c/lingvo/core/recurrent.py#L947

  1. We just open sourced an example heuristics per your request: https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/gpipe.py#L194

You can find example usage in the unit test: https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/gpipe_test.py#L154

bignamehyp avatar Jul 18 '19 04:07 bignamehyp

@bignamehyp Thanks for your reply!

I have been digging in the codebase (especially the lingvo/core/recurrent.py and the lingvo/core/gpipe.py) and learned a lot about GPipe. I am also trying to re-create the ResNet101 evaluation in the GPipe arXiv paper, starting from the small example in the lingvo/core/gpipe_test.py. I have a few questions:

  1. How to run the test in gpipe_test.py on GPUs? To run the example in gpipe_test.py on GPUs, I added use_gpu=True in self.session() on this line: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py#L143 and overrode the device assignment by forcing the cell_devices/devices to be GPU strings here: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L286 and here: https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe.py#L484 However, I get Error Messages like this (when running on 2 or 4 splits):

E tensorflow/stream_executor/stream.cc:332] Error recording event in stream: error recording CUDA event on stream 0x75bfa50: CUDA_ERROR_INVALID_HANDLE: invalid resource handle; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.

  1. How to add BatchNormalization layers to the simple model? I tried to add either BatchNormLayerNoPadding layers or BatchNormLayer layers to the _SimpyLayer, but I run into errors like this:
  • If I add BatchNormLayer, I get:

TypeError: In op 'layer_0/bn/bn/AssignMovingAvg', input types ([tf.float32, tf.float32]) are not compatible with expected types ([tf.float32_ref, tf.float32])

  • If I add BatchNormLayerNoPadding, I get:

    File "/usr/local/lib/python2.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3660, in _get_operation_by_name_unsafe return self._nodes_by_name[name] KeyError: u'layer_0/bn/add_1'

Sorry to bother you with my own bugs. It would also help greatly if some of the code for evaluation in the arXiv paper https://arxiv.org/abs/1811.06965 can be open-sourced. I am not very familiar with Lingvo in general, and I am not sure how to wrap the small example in gpipe_test.py into a registered model.

feiwang3311 avatar Jul 18 '19 15:07 feiwang3311