openvino
openvino copied to clipboard
[CPU] Optimize tail 'Convert' nodes time cost in f16 precision mark-up transformation
Details:
- First, all tail nodes (which are non-computationally intensive nodes) of the model will be collected. The process begins from the model's output, traversing upwards until encountering 'Convert' nodes that match the pattern. Optimization, including precision reset and removal of 'Convert' operations, will be performed if the following conditions are met: continue traversing upwards to verify that all "Path" nodes are suitable (such as data movement nodes like reshape and concat) until reaching the first "Begin" node, where internally, the convert instruction set is used to change the precision from model output precision to inference precision. Through this optimization, two types of 'Convert' operations (instruction set and model nodes) will be removed from the model.
Tickets:
- CVS-128883
Hi @usstq could you please take a review? thanks!
I second @usstq in his argument about inability of the ngraph representation to distinguish which data movement ops are optimized (are not run in the end) and which are not. However, fortunately CPU plugin has its own graph representation, and at this level we can recognize optimized data movement ops (check inPlace config). Theoretically, such an optimization may be expressed as a fusing of a convert layer into a compute op, but that would require additional pass of precision propagation. Another approach could be in enforcing the output precision in the subgraph, but in general it requires to ensure that the implementation supports this type, and such check will be excessive in the node interface. Therefore, the first option (convert fusing) is more safe and generic.
Hi, @maxnick : yes, I have also synced with @usstq offline about "recognize optimized data movement ops from cpu plugin", but it seems have these two following difficult:
- "inPlace config" can be got only after "PrimitiveDescriptors" created. so such 'convert fusing' pass should be after "InitDescriptors" stage instead of current "EnforceInferencePrecision" stage.
- the important one is that after 'convert fusing' pass, the previously created "PrimitiveDescriptors"(e.g. a0, a1, a2, a3 shown in @usstq 's example) should change it's input and output precision.
I found that following code in fp16->fp32 conversion causes performance issue due to mixing SSE/AVX instructions(https://www.intel.com/content/dam/develop/external/us/en/documents/11mc12-avoiding-2bavx-sse-2btransition-2bpenalties-2brh-2bfinal-809104.pdf), change movdqu into vmovdqu can fix it.
https://github.com/openvinotoolkit/openvino/blob/6a6221cd1dfbdcb547862d526e5a51ce9c0c05dc/src/plugins/intel_cpu/src/nodes/common/cpu_convert.cpp#L77
But ideally this extra convert can still be optimized by fusing it into parent node (like this PR does):
- if parent node is data-movement node (like non-inplacing eltwise/concat), it should be fused into it.
- if parent node is data-movement-free node (like inplacing concat/reshape), it should be fused into grandparent nodes
Hi, @usstq and @maxnick : : based on previous discussion, I have extend this tail nodes optimization to cover more common cases: including cpu_convert avoid using mixed SSE/AVX instruction, in-place tail nodes precision upconvert, non-in-place tail nodes fuse with convert. could you please help continue review it when you are free? Thanks.
Hi @maxnick could you please take a review? Thanks!
@EgorDuplensky , could you please review?
just FYI, I am looking at this and also trying to come up with a way to somehow simplify the idea in its core
just FYI, I am looking at this and also trying to come up with a way to somehow simplify the idea in its core
Hi, @EgorDuplensky : sure, thanks for the help. Previously, we wanted to address all related issues at once, which indeed made the newly added logic somewhat lengthy. and as far as I know some conditions are not observed on current real model sets yet, so I think maybe we could leave some conditions TODO to simplify it?
@liubo-intel What about changing the approach a bit. I was just thinking that this optimization is very similar (probably almost the same) as those ones: "MoveEltwiseUpThroughDataMov, MoveEltwiseUpThroughDataMovPerChannel". The idea is the same, we move eltwise through some data movement type of OPs to enable future fusion into some compute OP. We could do the same in our case:
- Move Convert through inplace nodes.
- Fuse Convert into operation.
Having two passes instead of one is supposed to simplify the logic (not sure if this is the case when we are talking about CPU graph though). Also having the 1st separate path can potentially enable another type of optimizations, such as DropDoubleConvert, DropReorderConvert and maybe something else. What do you think?
@liubo-intel What about changing the approach a bit. I was just thinking that this optimization is very similar (probably almost the same) as those ones: "MoveEltwiseUpThroughDataMov, MoveEltwiseUpThroughDataMovPerChannel". The idea is the same, we move eltwise through some data movement type of OPs to enable future fusion into some compute OP. We could do the same in our case:
- Move Convert through inplace nodes.
- Fuse Convert into operation.
Having two passes instead of one is supposed to simplify the logic (not sure if this is the case when we are talking about CPU graph though). Also having the 1st separate path can potentially enable another type of optimizations, such as DropDoubleConvert, DropReorderConvert and maybe something else. What do you think?
Hi, @EgorDuplensky : after reconsidering this again, I think using two passes may not simplify the logic too much, since we can't move Convert through all inplace node cases: if in inplace nodes path, there are many parentNodes, and the final nodes can't fuse convert, then we will use many converts to replace one. and in two passes, some same path check logic need to do in each pass. and this pr's target is to solve fp16 tail's convert problem(comparing with bf16 which use enforce f32 tails), and it's process limit to quite few tail nodes, which means its effect on model compile time is quite little. to cover more common Convert cases, maybe we could use new ticket to analysis and implement?