lingvo Query: GPipe graph node assignment

Hi @bignamehyp and other developers,

I am trying to understand the GPipe's composite layer assignment heuristic for each GPU and how the GPUs interact and I am not sure if my understanding is correct. Appreciate your inputs.

GPipe partitions a set of sequential L layers into K composite layers. The partition strategy essentially tries to minimize the variance of each composite layer's estimated cost. Then each of the layers is distributed across GPUs in a round-robin manner.

Here are my questions:

If no cost function is provided, how does gpipe determine the estimated cost? Is it based on how much compute one has to do? (I could not find this in the codebase).
I looked at the worker assignment done by _LogPlacement and I am confused why some of the layers are mapped to an accelerator device. For transformers language model, I have attached how the mapping has happened for 1 layer in 4 GPU (i removed unnecessary words from the log and kept it simple).

There are a few interesting questions which I did not expect: a. why do the Gpipe consider the first few layers as composite layers? b. The sorting of the composite task seems to be very weird here. Is it not? c. I don't understand why it weights and bias got split into two GPUs. d. what exactly is different between query_proj and query_proj_b (Fprop and Bprop)? e. Why is global_step a composite task?

Thanks!

Composite layers	GPU Assignment
global_step	0
input._tokenizer_default.global_step	1
input.global_step	2
learners[0].global_step	3
learners[0].lr_schedule.global_step	0
learners[0].optimizer.global_step	1
lm.emb.global_step	2
lm.emb.wm	3
lm.global_step	0
lm.input_dropout.global_step	1
lm.position_emb.global_step	2
lm.softmax.bias_0	3
lm.softmax.bias_1	0
lm.softmax.bias_10	1
lm.softmax.bias_11	2
lm.softmax.bias_12	3
lm.softmax.bias_13	0
lm.softmax.bias_14	1
lm.softmax.bias_15	2
lm.softmax.bias_2	3
lm.softmax.bias_3	0
lm.softmax.bias_4	1
lm.softmax.bias_5	2
lm.softmax.bias_6	3
lm.softmax.bias_7	0
lm.softmax.bias_8	1
lm.softmax.bias_9	2
lm.softmax.global_step	3
lm.softmax.weight_0	0
lm.softmax.weight_1	1
lm.softmax.weight_10	2
lm.softmax.weight_11	3
lm.softmax.weight_12	0
lm.softmax.weight_13	1
lm.softmax.weight_14	2
lm.softmax.weight_15	3
lm.softmax.weight_2	0
lm.softmax.weight_3	1
lm.softmax.weight_4	2
lm.softmax.weight_5	3
lm.softmax.weight_6	0
lm.softmax.weight_7	1
lm.softmax.weight_8	2
lm.softmax.weight_9	3
lm.stack.cell_0.encoder_0.fflayer.fflayer.dropout[0].global_step	0
lm.stack.cell_0.encoder_0.fflayer.fflayer.dropout[1].global_step	1
lm.stack.cell_0.encoder_0.fflayer.fflayer.fc[0].b	2
lm.stack.cell_0.encoder_0.fflayer.fflayer.fc[0].global_step	3
lm.stack.cell_0.encoder_0.fflayer.fflayer.fc[0].w	0
lm.stack.cell_0.encoder_0.fflayer.fflayer.fc[1].b	1
lm.stack.cell_0.encoder_0.fflayer.fflayer.fc[1].global_step	2
lm.stack.cell_0.encoder_0.fflayer.fflayer.fc[1].w	3
lm.stack.cell_0.encoder_0.fflayer.fflayer.global_step	0
lm.stack.cell_0.encoder_0.fflayer.global_step	1
lm.stack.cell_0.encoder_0.fflayer.layer_norm.bias	2
lm.stack.cell_0.encoder_0.fflayer.layer_norm.global_step	3
lm.stack.cell_0.encoder_0.fflayer.layer_norm.scale	0
lm.stack.cell_0.encoder_0.fflayer.residual_dropout.global_step	1
lm.stack.cell_0.encoder_0.global_step	2
lm.stack.cell_0.encoder_0.self_atten.atten.atten.global_step	3
lm.stack.cell_0.encoder_0.self_atten.atten.atten.per_dim_scale	0
lm.stack.cell_0.encoder_0.self_atten.atten.ctx_post_proj	1
lm.stack.cell_0.encoder_0.self_atten.atten.ctx_post_proj_b	2
lm.stack.cell_0.encoder_0.self_atten.atten.ctx_proj	3
lm.stack.cell_0.encoder_0.self_atten.atten.ctx_proj_b	0
lm.stack.cell_0.encoder_0.self_atten.atten.global_step	1
lm.stack.cell_0.encoder_0.self_atten.atten.query_proj	2
lm.stack.cell_0.encoder_0.self_atten.atten.query_proj_b	3
lm.stack.cell_0.encoder_0.self_atten.atten.source_proj	0
lm.stack.cell_0.encoder_0.self_atten.atten.source_proj_b	1
lm.stack.cell_0.encoder_0.self_atten.global_step	2
lm.stack.cell_0.encoder_0.self_atten.layer_norm.bias	3
lm.stack.cell_0.encoder_0.self_atten.layer_norm.global_step	0
lm.stack.cell_0.encoder_0.self_atten.layer_norm.scale	1
lm.stack.cell_0.encoder_0.self_atten.residual_dropout.global_step	2
lm.stack.cell_0.encoder_1.fflayer.fflayer.dropout[0].global_step	3
lm.stack.cell_0.encoder_1.fflayer.fflayer.dropout[1].global_step	0
lm.stack.cell_0.encoder_1.fflayer.fflayer.fc[0].b	1
lm.stack.cell_0.encoder_1.fflayer.fflayer.fc[0].global_step	2
lm.stack.cell_0.encoder_1.fflayer.fflayer.fc[0].w	3
lm.stack.cell_0.encoder_1.fflayer.fflayer.fc[1].b	0
lm.stack.cell_0.encoder_1.fflayer.fflayer.fc[1].global_step	1
lm.stack.cell_0.encoder_1.fflayer.fflayer.fc[1].w	2
lm.stack.cell_0.encoder_1.fflayer.fflayer.global_step	3
lm.stack.cell_0.encoder_1.fflayer.global_step	0
lm.stack.cell_0.encoder_1.fflayer.layer_norm.bias	1
lm.stack.cell_0.encoder_1.fflayer.layer_norm.global_step	2
lm.stack.cell_0.encoder_1.fflayer.layer_norm.scale	3
lm.stack.cell_0.encoder_1.fflayer.residual_dropout.global_step	0
lm.stack.cell_0.encoder_1.global_step	1
lm.stack.cell_0.encoder_1.self_atten.atten.atten.global_step	2
lm.stack.cell_0.encoder_1.self_atten.atten.atten.per_dim_scale	3
lm.stack.cell_0.encoder_1.self_atten.atten.ctx_post_proj	0
lm.stack.cell_0.encoder_1.self_atten.atten.ctx_post_proj_b	1
lm.stack.cell_0.encoder_1.self_atten.atten.ctx_proj	2
lm.stack.cell_0.encoder_1.self_atten.atten.ctx_proj_b	3
lm.stack.cell_0.encoder_1.self_atten.atten.global_step	0
lm.stack.cell_0.encoder_1.self_atten.atten.query_proj	1
lm.stack.cell_0.encoder_1.self_atten.atten.query_proj_b	2
lm.stack.cell_0.encoder_1.self_atten.atten.source_proj	3
lm.stack.cell_0.encoder_1.self_atten.atten.source_proj_b	0
lm.stack.cell_0.encoder_1.self_atten.global_step	1
lm.stack.cell_0.encoder_1.self_atten.layer_norm.bias	2
lm.stack.cell_0.encoder_1.self_atten.layer_norm.global_step	3
lm.stack.cell_0.encoder_1.self_atten.layer_norm.scale	0
lm.stack.cell_0.encoder_1.self_atten.residual_dropout.global_step	1
lm.stack.cell_3.global_step	3
lm.stack.global_step	0

Jul 02 '19 22:07 msharmavikram

I am curious about this too. In fact, I am trying to re-do the benchmarks mentioned in the GPipe arXiv paper and GPipe blog (https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) but on GPUs. However, I am not sure where/how to start. This file (https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py) offered an example of running a 16 layer conv on 4 GPUs but used manual splitting.

@msharmavikram did you find the code that actually handles the heuristics and graph splitting? It seems that you already run something using automated graph splitting heuristics (though the splitting seems strange), are you willing to share the code snippet that runs this experiment?

Thanks,

Fei

Jul 09 '19 01:07 feiwang3311

watch this.

Jul 13 '19 09:07 zh794390558

(1) The cost of each layer is estimated by the FPropMeta function. See examples in core/layers.py. Example: https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/layers.py#L2772

(2) That was to create local copies on the GPUS in the round robin way.

https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/base_model.py#L461

That is why you saw weights and bias got split into two GPUs. This local copying only happened once during the init time. During the actual compute time, the weights would be distributed the same way as the layer partition.

Basically please ignore the output from _LogPlacement.

(3) In your table, the left column contains variable names, not layers. Each layer should be a sub-class of base_layer.BaseLayer.

(4) query_proj_b is the bias of query projection. https://github.com/tensorflow/lingvo/blob/464c4386a05d108056becb106b2b827df968b615/lingvo/core/attention.py#L1097

Jul 18 '19 05:07 bignamehyp

I am curious about this too. In fact, I am trying to re-do the benchmarks mentioned in the GPipe arXiv paper and GPipe blog (https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) but on GPUs. However, I am not sure where/how to start. This file (https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py) offered an example of running a 16 layer conv on 4 GPUs but used manual splitting.

@msharmavikram did you find the code that actually handles the heuristics and graph splitting? It seems that you already run something using automated graph splitting heuristics (though the splitting seems strange), are you willing to share the code snippet that runs this experiment?

Thanks,

Fei

Do you run Resnet-101 with Gpipe success? I have some problem with it。。。

Oct 16 '19 08:10 gaokai0810

I am curious about this too. In fact, I am trying to re-do the benchmarks mentioned in the GPipe arXiv paper and GPipe blog (https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) but on GPUs. However, I am not sure where/how to start. This file (https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py) offered an example of running a 16 layer conv on 4 GPUs but used manual splitting. @msharmavikram did you find the code that actually handles the heuristics and graph splitting? It seems that you already run something using automated graph splitting heuristics (though the splitting seems strange), are you willing to share the code snippet that runs this experiment? Thanks, Fei

Do you run Resnet-101 with Gpipe success? I have some problem with it。。。

Not really. I coded a model of Resnet-50 and was able to split it using the given heuristics and run it. But I am not sure if I am using GPipe correctly. I had a bunch of Error/Warning messages and my model didn't seem to be fast. Now I moved away from my last internship and I have no GPU resources that allow me to install Nvidia Docker. I have tried manual install (without docker) but without success. So I cannot even run what I had months ago.

Oct 16 '19 13:10 feiwang3311

I am curious about this too. In fact, I am trying to re-do the benchmarks mentioned in the GPipe arXiv paper and GPipe blog (https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) but on GPUs. However, I am not sure where/how to start. This file (https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py) offered an example of running a 16 layer conv on 4 GPUs but used manual splitting. @msharmavikram did you find the code that actually handles the heuristics and graph splitting? It seems that you already run something using automated graph splitting heuristics (though the splitting seems strange), are you willing to share the code snippet that runs this experiment? Thanks, Fei

Do you run Resnet-101 with Gpipe success? I have some problem with it。。。

Not really. I coded a model of Resnet-50 and was able to split it using the given heuristics and run it. But I am not sure if I am using GPipe correctly. I had a bunch of Error/Warning messages and my model didn't seem to be fast. Now I moved away from my last internship and I have no GPU resources that allow me to install Nvidia Docker. I have tried manual install (without docker) but without success. So I cannot even run what I had months ago.

Can you show the code of Resnet-50 mode for me to learn？Thank you very much！！！

Nov 27 '19 01:11 gaokai0810

I am curious about this too. In fact, I am trying to re-do the benchmarks mentioned in the GPipe arXiv paper and GPipe blog (https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) but on GPUs. However, I am not sure where/how to start. This file (https://github.com/tensorflow/lingvo/blob/master/lingvo/core/gpipe_test.py) offered an example of running a 16 layer conv on 4 GPUs but used manual splitting. @msharmavikram did you find the code that actually handles the heuristics and graph splitting? It seems that you already run something using automated graph splitting heuristics (though the splitting seems strange), are you willing to share the code snippet that runs this experiment? Thanks, Fei

Do you run Resnet-101 with Gpipe success? I have some problem with it。。。

Not really. I coded a model of Resnet-50 and was able to split it using the given heuristics and run it. But I am not sure if I am using GPipe correctly. I had a bunch of Error/Warning messages and my model didn't seem to be fast. Now I moved away from my last internship and I have no GPU resources that allow me to install Nvidia Docker. I have tried manual install (without docker) but without success. So I cannot even run what I had months ago.

Can you show the code of Resnet-50 mode for me to learn？Thank you very much！！！

Thanks for the passion but I cannot share the code since it belongs to the company where I did my last internship. I do think that the layered model is not my gem, and even with sufficient skills in TensorFlow and Keras, I still find GPipe hard to learn to use.

Interestingly, I found another pipeline parallelization method called PipeDream, which is a little bit easier to use. However, do remember PipeDream uses PyTorch and thus the tensor communication is a bit slower. The link to PipeDream is here: https://github.com/msr-fiddle/pipedream

It is also possible to implement GPipe or PipeDream yourself via computation graph transformation. However, something like that is under implementation by someone I know so it cannot be shared.

Let me know if there are other ways that I can help you.

Nov 27 '19 02:11 feiwang3311

lingvo lingvo copied to clipboard

Query: GPipe graph node assignment

lingvo
lingvo copied to clipboard