glow
glow copied to clipboard
[IROptimizer] Further remove useless copy instructions
One such example we can optimize is for this graph:
Concat(inp1, inp2, ..., inpN)
For the particular case when the Concat
node concatenates contiguous slices (e.g. concatenation is done along the 1st dimension) the operator translates at IR level in copy instructions (each input slice is copied as a contiguous block in the output buffer).
Instead what we could do to remove these copy instructions would be to modify the IR such that the producers of inp1
, inp2
... would write directly in the slice portions of the final Concat output buffer.
@opti-mix Do know how to do this easily and elegantly? I wasted 2 days in trying to make an IR optimization pass which does the above but I ended nowhere ... it just seems too complicated. I need this optimization pass for a particular benchmark which gets unattractive because of the overhead of these useless copies.
@mciprian13 Just to check that I understand you correctly. You'd like to modify the IR so that the produces of inpN write into a tensorview, which is the slice of the final Concat buffer, correct?
IIRC, Glow has a similar optimization in the IROptimizer already called optimizeInserts, no? But maybe it is not general enough for your case.
@opti-mix Yeah I`ve seen that optimization but it seems it is not generic enough and does not kick in for my case. https://github.com/pytorch/glow/blob/39a8c689f252076ff5842c1870523b420e509b72/lib/Optimizer/IROptimizer/IROptimizer.cpp#L1347-L1352
This optimization is used only if the InsertTensor
has an allocActivation
as a source.
In my case the source of the InsertTensors are TensorViews because the concat inputs where Reshape nodes.
I guess later in the IR optimizer pipeline the InsertTensor
are transformed into Copy
instructions. Basically what I end up with is copy instructions having TensorViews as both input and output.
So basically it would be nice to have some sort of optimization to remove Copy instructions more aggressively.
@mciprian13 I see. Yes, seems like a generalized copy elimination that is aware of tensor views would be needed. Do you happen to have a small instruction IR-level unit test to reproduce the issue? It could be useful while thinking about a solution.
@opti-mix I can provide a model for which this happens: IROptModel.zip You can find in the archive:
- A MobileNet SSD model in ONNX format (model publicly available taken from ONNX zoo)
- The CLI command to compile using the model-compiler Glow tool and dump the Glow IR:
model-compiler -backend=CPU -model=mobilenet_v1_0.75_ssd.onnx -emit-bundle=bundle -dump-ir > model_ir.txt
- The IR file
model_ir.txt
dumped by the above command
In the IR file model_ir.txt
you can find 12 x Copy instructions which are very expensive when everything else in the graph is executed by a powerful accelerator. You will also see that all those Copy instruction have TensorViews as both input and output.
Let me know what solution you would see to this.
Thanks!
@opti-mix Did you have time to investigate this optimization?
@mciprian13 Sorry, I was busy with some other urgent stuff. Haven't spent any reasonable time on this yet.
@opti-mix Ok no problem. Btw do you think it is worth organizing some meetings with all the Glow contributors to exchange/share ideas about the Glow future, identify groups of people with commons interests and which could collaborate, or maybe other purposes? WDYT?