openvino [CPU] Optimize tail 'Convert' nodes time cost in f16 precision mark-up transformation

trafficstars

Details:

First, all tail nodes (which are non-computationally intensive nodes) of the model will be collected. The process begins from the model's output, traversing upwards until encountering 'Convert' nodes that match the pattern. Optimization, including precision reset and removal of 'Convert' operations, will be performed if the following conditions are met: continue traversing upwards to verify that all "Path" nodes are suitable (such as data movement nodes like reshape and concat) until reaching the first "Begin" node, where internally, the convert instruction set is used to change the precision from model output precision to inference precision. Through this optimization, two types of 'Convert' operations (instruction set and model nodes) will be removed from the model.

Tickets:

CVS-128883

Jun 12 '25 02:06 liubo-intel

Hi @usstq could you please take a review? thanks!

Jun 16 '25 01:06 yuxu42

I second @usstq in his argument about inability of the ngraph representation to distinguish which data movement ops are optimized (are not run in the end) and which are not. However, fortunately CPU plugin has its own graph representation, and at this level we can recognize optimized data movement ops (check inPlace config). Theoretically, such an optimization may be expressed as a fusing of a convert layer into a compute op, but that would require additional pass of precision propagation. Another approach could be in enforcing the output precision in the subgraph, but in general it requires to ensure that the implementation supports this type, and such check will be excessive in the node interface. Therefore, the first option (convert fusing) is more safe and generic.

Jul 01 '25 09:07 maxnick

Hi, @maxnick : yes, I have also synced with @usstq offline about "recognize optimized data movement ops from cpu plugin", but it seems have these two following difficult:

"inPlace config" can be got only after "PrimitiveDescriptors" created. so such 'convert fusing' pass should be after "InitDescriptors" stage instead of current "EnforceInferencePrecision" stage.
the important one is that after 'convert fusing' pass, the previously created "PrimitiveDescriptors"(e.g. a0, a1, a2, a3 shown in @usstq 's example) should change it's input and output precision.

Jul 01 '25 09:07 liubo-intel

I found that following code in fp16->fp32 conversion causes performance issue due to mixing SSE/AVX instructions(https://www.intel.com/content/dam/develop/external/us/en/documents/11mc12-avoiding-2bavx-sse-2btransition-2bpenalties-2brh-2bfinal-809104.pdf), change movdqu into vmovdqu can fix it.

https://github.com/openvinotoolkit/openvino/blob/6a6221cd1dfbdcb547862d526e5a51ce9c0c05dc/src/plugins/intel_cpu/src/nodes/common/cpu_convert.cpp#L77

But ideally this extra convert can still be optimized by fusing it into parent node (like this PR does):

if parent node is data-movement node (like non-inplacing eltwise/concat), it should be fused into it.
if parent node is data-movement-free node (like inplacing concat/reshape), it should be fused into grandparent nodes

Jul 02 '25 12:07 usstq

Hi, @usstq and @maxnick : : based on previous discussion, I have extend this tail nodes optimization to cover more common cases: including cpu_convert avoid using mixed SSE/AVX instruction, in-place tail nodes precision upconvert, non-in-place tail nodes fuse with convert. could you please help continue review it when you are free? Thanks.

Jul 09 '25 00:07 liubo-intel

Hi @maxnick could you please take a review? Thanks!

Jul 10 '25 01:07 yuxu42

@EgorDuplensky , could you please review?

Jul 25 '25 08:07 maxnick

just FYI, I am looking at this and also trying to come up with a way to somehow simplify the idea in its core

Jul 31 '25 17:07 EgorDuplensky

just FYI, I am looking at this and also trying to come up with a way to somehow simplify the idea in its core

Hi, @EgorDuplensky : sure, thanks for the help. Previously, we wanted to address all related issues at once, which indeed made the newly added logic somewhat lengthy. and as far as I know some conditions are not observed on current real model sets yet, so I think maybe we could leave some conditions TODO to simplify it?

Aug 01 '25 00:08 liubo-intel

@liubo-intel What about changing the approach a bit. I was just thinking that this optimization is very similar (probably almost the same) as those ones: "MoveEltwiseUpThroughDataMov, MoveEltwiseUpThroughDataMovPerChannel". The idea is the same, we move eltwise through some data movement type of OPs to enable future fusion into some compute OP. We could do the same in our case:

Move Convert through inplace nodes.
Fuse Convert into operation.

Having two passes instead of one is supposed to simplify the logic (not sure if this is the case when we are talking about CPU graph though). Also having the 1st separate path can potentially enable another type of optimizations, such as DropDoubleConvert, DropReorderConvert and maybe something else. What do you think?

Aug 05 '25 10:08 EgorDuplensky

@liubo-intel What about changing the approach a bit. I was just thinking that this optimization is very similar (probably almost the same) as those ones: "MoveEltwiseUpThroughDataMov, MoveEltwiseUpThroughDataMovPerChannel". The idea is the same, we move eltwise through some data movement type of OPs to enable future fusion into some compute OP. We could do the same in our case:

Move Convert through inplace nodes.

Fuse Convert into operation.

Having two passes instead of one is supposed to simplify the logic (not sure if this is the case when we are talking about CPU graph though). Also having the 1st separate path can potentially enable another type of optimizations, such as DropDoubleConvert, DropReorderConvert and maybe something else. What do you think?

Hi, @EgorDuplensky : after reconsidering this again, I think using two passes may not simplify the logic too much, since we can't move Convert through all inplace node cases: if in inplace nodes path, there are many parentNodes, and the final nodes can't fuse convert, then we will use many converts to replace one. and in two passes, some same path check logic need to do in each pass. and this pr's target is to solve fp16 tail's convert problem(comparing with bf16 which use enforce f32 tails), and it's process limit to quite few tail nodes, which means its effect on model compile time is quite little. to cover more common Convert cases, maybe we could use new ticket to analysis and implement?

Aug 06 '25 06:08 liubo-intel

openvino openvino copied to clipboard

[CPU] Optimize tail 'Convert' nodes time cost in f16 precision mark-up transformation

Details:

Tickets:

openvino
openvino copied to clipboard