FEX
FEX copied to clipboard
Scalar vector performance issues with FMOD
FMOD burns a CPU core, specifically in Hollow Knight's title screen. Seems to be a little loop that spins really hot that we don't optimize well. It's bad enough that a ton of Unity+FMOD games will stutter their audio quite hard.
This is due to how the code is using a loop-unrolled scalar FMA to process audio blocks and FEX converting the code badly.
Original code.
01aaf061 89ce mov esi, ecx
01aaf063 4889ea mov rdx, rbp
01aaf066 4889d8 mov rax, rbx
01aaf069 f30f1012 movss xmm2, dword [rdx]
01aaf06d 4883c020 add rax, 0x20
01aaf071 f30f59d0 mulss xmm2, xmm0
01aaf075 4883c220 add rdx, 0x20
01aaf079 f30f5850e0 addss xmm2, dword [rax-0x20]
01aaf07e f30f1150e0 movss dword [rax-0x20], xmm2
01aaf083 f30f1052e4 movss xmm2, dword [rdx-0x1c]
01aaf088 f30f59d1 mulss xmm2, xmm1
01aaf08c f30f5850e4 addss xmm2, dword [rax-0x1c]
01aaf091 f30f1150e4 movss dword [rax-0x1c], xmm2
01aaf096 f30f1052e8 movss xmm2, dword [rdx-0x18]
01aaf09b f30f59d0 mulss xmm2, xmm0
01aaf09f f30f5850e8 addss xmm2, dword [rax-0x18]
01aaf0a4 f30f1150e8 movss dword [rax-0x18], xmm2
01aaf0a9 f30f1052ec movss xmm2, dword [rdx-0x14]
01aaf0ae f30f59d1 mulss xmm2, xmm1
01aaf0b2 f30f5850ec addss xmm2, dword [rax-0x14]
01aaf0b7 f30f1150ec movss dword [rax-0x14], xmm2
01aaf0bc f30f1052f0 movss xmm2, dword [rdx-0x10]
01aaf0c1 f30f59d0 mulss xmm2, xmm0
01aaf0c5 f30f5850f0 addss xmm2, dword [rax-0x10]
01aaf0ca f30f1150f0 movss dword [rax-0x10], xmm2
01aaf0cf f30f1052f4 movss xmm2, dword [rdx-0xc]
01aaf0d4 f30f59d1 mulss xmm2, xmm1
01aaf0d8 f30f5850f4 addss xmm2, dword [rax-0xc]
01aaf0dd f30f1150f4 movss dword [rax-0xc], xmm2
01aaf0e2 f30f1052f8 movss xmm2, dword [rdx-0x8]
01aaf0e7 f30f59d0 mulss xmm2, xmm0
01aaf0eb f30f5850f8 addss xmm2, dword [rax-0x8]
01aaf0f0 f30f1150f8 movss dword [rax-0x8], xmm2
01aaf0f5 f30f1052fc movss xmm2, dword [rdx-0x4]
01aaf0fa f30f59d1 mulss xmm2, xmm1
01aaf0fe f30f5850fc addss xmm2, dword [rax-0x4]
01aaf103 f30f1150fc movss dword [rax-0x4], xmm2
01aaf108 83ee01 sub esi, 0x1
01aaf10b 0f8558ffffff jne 0x1aaf069
Single step of the job:
00010029 f30f1052e0 movss xmm2, dword [rdx-0x20]
0001002e f30f59d0 mulss xmm2, xmm0
00010032 f30f5850e0 addss xmm2, dword [rax-0x20]
00010037 f30f1150e0 movss dword [rax-0x20], xmm2
Resulting AArch64 asm
0x0000ffffe2d000bc: ldur s18, [x6, #-32]
0x0000ffffe2d000c0: fmul s4, s18, s16
0x0000ffffe2d000c4: mov v18.s[0], v4.s[0]
0x0000ffffe2d000c8: ldur s4, [x4, #-32]
0x0000ffffe2d000cc: fadd s4, s18, s4
0x0000ffffe2d000d0: mov v18.s[0], v4.s[0]
0x0000ffffe2d000d4: eor v0.16b, v0.16b, v0.16b
0x0000ffffe2d000d8: mov v0.s[0], v18.s[0]
0x0000ffffe2d000dc: mov v4.16b, v0.16b
0x0000ffffe2d000e0: stur s4, [x4, #-32]
Ideally the AArch64 asm should be:
ldur s18, [x6, #-32]
fmul s18, s18, s16
ldur s4, [x4, #-32]
fadd s18, s18, s4
stur s18, [x4, #-32]
IPC numbers are trash according to llvm-mca:
ryanh@ubuntu-linux-20-04-desktop:~$ llvm-mca -mcpu=cortex-a57 test.s
Iterations: 100
Instructions: 1000
Total Cycles: 3214
Total uOps: 1300
Dispatch Width: 3
uOps Per Cycle: 0.40
IPC: 0.31
Block RThroughput: 5.0
Theoretical improvements:
ryanh@ubuntu-linux-20-04-desktop:~$ llvm-mca -mcpu=cortex-a57 test_opt.s
Iterations: 100
Instructions: 500
Total Cycles: 217
Total uOps: 500
Dispatch Width: 3
uOps Per Cycle: 2.30
IPC: 2.30
Block RThroughput: 2.0
Sample FEX ASM unit test that can see the issue looking at the generated code.
; This is based on the hottest block of Unity+FMOD's audio code from Hollow Knight
; This is doing some sort of unrolled scalar FMA that works on a full block at a time
; xmm0 contains one channel multiplication value
; xmm1 contains the other channel?
; These come from arguments
movss xmm0, dword [rel scalar_val]
movss xmm1, dword [rel scalar_val2]
; Break the block to be closer to original block
jmp local
local:
; RAX contains the source block of data. Each block containing 32-bytes of data
; The data is modified in-place
; Data format
; struct Data {
; float data;
; float adder;
; }
mov rax, 0xe8000000
; ESI contains the number of blocks to operate on, Only one block in this test
mov esi, 1
; RDX contains the source value to multiply by
mov rdx, 0xe8000020
top:
add rax, 0x20 ; Increment Source block
add rdx, 0x20 ; Increment multiplier stereo blocks
; Do the remainder in a repeating macro
%macro OneChannelUnroll 2
; Load chan1
movss xmm2, dword [rdx - %1]
; Multiply chan1 by channel multiplier
mulss xmm2, %2
; Add to data source and store back to memory
addss xmm2, dword [rax - %1]
movss dword [rax - %1], xmm2
%endmacro
; Channel 1
OneChannelUnroll 0x20, xmm0
; Channel 2
OneChannelUnroll 0x1c, xmm1
;OneChannelUnroll 0x18, xmm0
;OneChannelUnroll 0x14, xmm1
;OneChannelUnroll 0x10, xmm0
;OneChannelUnroll 0x0c, xmm1
;OneChannelUnroll 0x08, xmm0
;OneChannelUnroll 0x04, xmm1
sub esi, 1
jne top
hlt
scalar_val:
dd 1.0
scalar_val2:
dd 1.0
The problem is the VInsertElement
instructions that are inserted for scalar-vector ops. This "fixes" the issue and generates the ideal assembly:
diff --git a/External/FEXCore/Source/Interface/Core/OpcodeDispatcher/Vector.cpp b/External/FEXCore/Source/Interface/Core/OpcodeDispatcher/Vector.cpp
index 33a88a147..9f6e13fcc 100644
--- a/External/FEXCore/Source/Interface/Core/OpcodeDispatcher/Vector.cpp
+++ b/External/FEXCore/Source/Interface/Core/OpcodeDispatcher/Vector.cpp
@@ -594,10 +594,12 @@ void OpDispatchBuilder::VectorScalarALUOpImpl(OpcodeArgs, IROps IROp, size_t Ele
OrderedNode* Result = ALUOp;
+#if 0
if (DstSize != ElementSize) {
// Insert the lower bits
Result = _VInsElement(DstSize, ElementSize, 0, 0, Dest, ALUOp);
}
+#endif
StoreResult(FPRClass, Op, Result, -1);
However, this is surely wrong. I haven't wrapped my head around how any of this works yet, and it looks like SRA makes this especially "fun".
Although actually maybe that's ok, because StoreRegister is supposed to mask anyway? or there's no masking that's expected to happen at all? or....... I don't really know what's happening here.
I believe (from a quick cursory glance) that getting rid of the insert can work if the corresponding StoreResult below it is changed into:
StoreResult_WithOpSize(FPRClass, Op, Op->Dest, Result, ElementSize, -1);
(regular StoreResult
derives the passed in op size using GetDstSize(Op)
, so it'd be treated as a full-length vector store otherwise, which in this case would clobber all the elements in the destination instead of merging the first element into the destination vector).
All that would really do though is move the insertion behavior elsewhere (in StoreResult_WithOpSize) though, so I don't know if that's desirable.