euddraft
euddraft copied to clipboard
Switch lowering optimization: Take 2
We replaced binary search with jump table for switch in https://github.com/armoha/euddraft/issues/64. Jump table is optimal when cases are dense. Some cases are not suitable for jump table.
Background
- 2015 LLVM Developers’ Meeting: Hans Wennborg "The recent switch lowering improvements"
- LLVM::SelectionDAGBuilder::visitSwitch
- LLVM::SwitchCG::sortAndRangeify
Counterexample
switch (v) {
case 0x3: ①;
case 0xC: ②;
case 0x30: ③;
case 0xC0: ④;
case 0x300: ⑤;
case 0xC00: ⑥;
}
Current eudplib implementation (jump table)
See: https://github.com/armoha/eudplib/blob/a785157e248f3cd07023b7b8eeecf5aa7d7c97c1/eudplib/ctrlstru/swblock.py#L129-L201
Compare bit 1, 2, 4, 8, ... , 2048. Compare 12 times for 7 cases (①~⑥ and default branch), and also waste lots of space (20 * 2048 bytes).
const jumptable = [C0A0Trig->dest(n) for n in range(4096)];
const cull->default = if (v & 0xFFFFF000 == 0) { cull->swstart; }
const swstart = { jump->jumptable; cull->default; }
for i, bit in enumerate(bits(4095)):
if (v & bit == bit) { jump.nextptr += 20*(1<<i); }
const jump = LastTrigger();
Potential improvement
- Enlarge minimal discriminating unit from single bit. (above example will do 6 comparisons, execute 9T 3~9A {21~28})
- For value switch, we could do direct addition,
EUDJump(jump_table + v)
.
Alternatives
1. Straight comparisons
{ Restore=SetMemoryEPD(epd, SetTo, value); }
const branch1 = if (v == 0x3) { branch1->①; Restore.epd = branch1.nextptr; Restore.value = branch2; }
const branch2 = if (v == 0xC) { branch2->②; Restore.epd = branch2.nextptr; Restore.value = branch3; }
const branch3 = if (v == 0x30) { branch3->③; Restore.epd = branch3.nextptr; Restore.value = branch4; }
const branch4 = if (v == 0xC0) { branch4->④; Restore.epd = branch4.nextptr; Restore.value = branch5; }
const branch5 = if (v == 0x300) { branch5->⑤; Restore.epd = branch5.nextptr; Restore.value = branch6; }
const branch6->default = if (v == 0xC00) { branch6->⑥; Restore.epd = branch6.nextptr; Restore.value = default; }
Let's assume cases ①~⑥ and the others having value in mask 0xFFF
or not, happening equal frequency.
(Performance median: {14.5}, mean: {13.5})
- others having value in
0xFFF
or not: execute 7T 1A {15} and go to default branch. - case
0x3
: execute 2T 4A {8} and go to branch ①. - case
0xC
: execute 3T 4A {10} and go to branch ②. - case
0x30
: execute 4T 4A {12} and go to branch ③. - case
0xC0
: execute 5T 4A {14} and go to branch ④. - case
0x300
: execute 6T 4A {16} and go to branch ⑤. - case
0xC00
: execute 7T 4A {18} and go to branch ⑥.
- Disadvantage: Slow when there are many latter cases like ⑤~⑥, or not belong to any case.
- Conclusion: Not suitable in this situation.
2. Bit tests
Not applicable since DWORD can't envelope all cases.
// bit test example
SetSwitch(v, Set);
if (MemoryX(0x58DC40, AtLeast, 1, (1<<0) + (1<<3) + (1<<7) + (1<<12) + (1<<19))) {
println("v is 0, 3, 7, 12 or 19");
}
3. Binary search
Question: Should we divide 6 cases by 4:2 or 3:3?
// Case 4:2 example
const branch1234 = if (v >= 0x300) { branch1234->branch5; }
const branch12 = if (v >= 0x30) { branch12->branch3; }
const branch1 = if (v == 0x3) { branch1->jump1; }
const branch2->default = if (v == 0xC) { branch2->jump2; }
const branch3 = if (v == 0x30) { branch3->jump3; }
const branch4->default = if (v == 0xC0) { branch4->jump4; }
const branch5 = if (v == 0x300) { branch5->jump5; }
const branch6->default = if (v == 0xC00) { branch6->jump6; }
const jump1->① = { branch1->branch2; }
const jump2->② = { branch2->default; }
const jump3->③ = { branch3->branch4; branch12->branch1; }
const jump4->④ = { branch4->default; branch12->branch1; }
const jump5->⑤ = { branch5->branch6; branch1234->branch12; }
const jump6->⑥ = { branch6->default; branch1234->branch12; }
(Performance median: {11}, mean: {10.75})
- others: execute 4T 0A {8} if
v < 0x30
, 4T 1A {9} if0x30 <= v < 0x300
, 3T 1A {7} if0x300 <= v
. - case
0x3
: execute 4T 2A {10} and go to branch ①. - case
0xC
: execute 5T 2A {12} and go to branch ②. - case
0x30
: execute 4T 4A {12} and go to branch ③. - case
0xC0
: execute 5T 4A {14} and go to branch ④. - case
0x300
: execute 3T 4A {10} and go to branch ⑤. - case
0xC00
: execute 4T 4A {12} and go to branch ⑥.
Need to combine multiple switch lowering strategies, by clustering cases and build binary search tree. We could add bit tests too.
TODO: fix switch binary search by using 2×cases triggers instead of 6×cases