openvino
openvino copied to clipboard
[CPU] tails optimization for load/store emitters
Details:
- item1
Tickets:
- 87986
@a-sidorova please review the PR
In general LGTM. Would be nice if we can run internal perf validation before merge.
In general LGTM. Would be nice if we can run internal perf validation before merge.
Sure, internal perf validation is WIP, thanks
@chenhu-wang do we have perf validation results?
@chenhu-wang do we have perf validation results?
@dmitry-gorokhov , the result is updated in ticket. Thanks!
@chenhu-wang Please rebase the branch
@chenhu-wang Please rebase the branch
@dmitry-gorokhov, I have rebased on latest master, there are some failure appeared for convert and split node on CI on Windows. WIP to analysis. @a-sidorova appreciate comment if you have some insights. aux_gpr_reg maybe handled incorrectly I guess.
@chenhu-wang I was investigating problems on Win on your PR and I have found out that StoreEmitter
works differently on OSs for 1 store byte. You added changes for 1byte in 48372f7ac5e650c8c436514e6e3586eca27f105
Let's take a look on this part:
h->uni_vmovq(Reg64(aux_gpr_idxs[0]), xmm);
h->mov(addr(start_bytes), Reg8(aux_gpr_idxs[0]));
Disassembled code on Linux:
vmovq %xmm1, %rdx
mov %dl, (%rcx)
Dissembled code on Windows:
vmovq %xmm1, %rsi
mov %dh, (%rbx)
So as you can see that Snippets chooses different GPRs for memory pointers (in our example for dst memory on Linux it's RCX
, for Win it's RBX
). But it's not important. The important thing is that aux_gpr_idxs[0]
is different too. For Linux it's RDX
and for Win it's RSI
.
In 48372f7ac5e650c8c436514e6e3586eca27f105 you have added Reg8(aux_gpr_idxs[0])
. As I know RDX
register has 8Bit version DH
so on Linux we don't have problems in our example. However RSI
register doesn't have 8Bit version but we see here DL
- I think it's XBYAK behavior but I'm not sure (I think you can see take a look mov
implementation in XBYAK)
Thus we will always have problems with StoreEmitter and 1 store_byte
if aux_gpr_idxs[0]
doesn't have 8Bit version.
@a-sidorova , Thanks a lot for your investigation and input. For rsp, rbp, rsi, rdi, it's lower 8bit part reg does not have the same index with 64 bit reg. Because there are more than 16 8bit regs. e.g. rax has al and ah. The correct idx is 32+index for these 4 gpr. Correct it in 63f1063.
@chenhu-wang Please rebase the branch
@dmitry-gorokhov, I have rebased on latest master, there are some failure appeared for convert and split node on CI on Windows. WIP to analysis. @a-sidorova appreciate comment if you have some insights. aux_gpr_reg maybe handled incorrectly I guess.
@dmitry-gorokhov ,rebased and CI get green, thank you!
@a-sidorova , Thanks a lot for your investigation and input. For rsp, rbp, rsi, rdi, it's lower 8bit part reg does not have the same index with 64 bit reg. Because there are more than 16 8bit regs. e.g. rax has al and ah. The correct idx is 32+index for these 4 gpr. Correct it in 63f1063.
@a-sidorova @dmitry-gorokhov , the conversion happens here.
As we discussed offline @chenhu-wang could you please cover case when
src_prc != dst_prc
as well?It blocks my PR#12395.
load_emitter
andstore_emitter
are often used in the PR whenFakeQuantize
are insideSubgrah
body. So we have performance degradations forScalarTile
in snippets when we want to load and store I8/U8 data
This is extended, could you please taka a look? thank you.
@dmitry-gorokhov , could you please final review? Thanks.