BOLT
BOLT copied to clipboard
Remove duplicate conditional branches after ICF
I have run into a case where our bolt optimized executable contains code like this[1]:
│ mov $0xa00294,%ecx ; tuplelength
│ cmp %rcx,%rax
22 │ ↓ je 145
5 │ mov $0xa00294,%ecx ; tuplelength
│ cmp %rcx,%rax
│ ↓ je 145
│ mov $0xa00294,%ecx ; tuplelength
│ cmp %rcx,%rax
15 │ ↓ je 145
│ mov $0xa00294,%ecx ; tuplelength
│ cmp %rcx,%rax
6 │ ↓ je 145
│ mov $0xa00280,%ecx ; unicode_length
│ cmp %rcx,%rax
6 │ ↓ je 1ff
│ mov $0xab2700,%ecx ; threadstate_getframe
│ cmp %rcx,%rax
│ ↓ je 1e9
│ a0: mov %rbx,%rdi
│ cmp $0xa00294,%rax ; tuplelength
We generate some code which checks where a function pointer is pointing to on an object in order to use a inlined fast version.
ICF inside bolt figures out that list_length()
, set_len()
,... are all identical with tuplelength()
and merges them.
Left over are the now duplicate branches which all checked for the old sym names which are now all the same merged symbol tuplelength
.
I tried prototyping an optimization which tries to optimize this cases but because of my lack of knowledge of BOLT and LLVM MC framework could not get it working :(. Instead I did try to count how often this branches happen inside our binary (hopefully this code does what I want :-P): It triggered 248 times even though it only checks the direct predecessor if it has the same jmp condition.
// We try find duplicate conditional jumps like this:
// mov $0xa00294,%ecx ; tuplelength
// cmp %rcx,%rax
// je 145
// mov $0xa00294,%ecx ; tuplelength
// cmp %rcx,%rax
// je 145
for (BinaryBasicBlock &BB : Function) {
if (BB.succ_size() != 2)
continue;
if (BB.pred_size() != 1)
continue;
BinaryBasicBlock& PredBB = **BB.pred_begin();
// check that the predecessor BB terminates in a conditional branch
if (PredBB.succ_size() != 2)
continue;
BinaryBasicBlock *CondBB = PredBB.getConditionalSuccessor(true);
BinaryBasicBlock *UncondBB = PredBB.getConditionalSuccessor(false);
if (CondBB != &BB)
continue;
const MCSymbol *TBB = nullptr;
const MCSymbol *FBB = nullptr;
MCInst *CondBranch = nullptr;
MCInst *UncondBranch = nullptr;
bool Result = BB.analyzeBranch(TBB, FBB, CondBranch, UncondBranch);
if (!Result || !CondBranch)
continue;
const MCSymbol *PredTBB = nullptr;
const MCSymbol *PredFBB = nullptr;
MCInst *PredCondBranch = nullptr;
MCInst *PredUncondBranch = nullptr;
bool PredResult = PredBB.analyzeBranch(PredTBB, PredFBB, PredCondBranch, PredUncondBranch);
if (!PredResult || !PredCondBranch)
continue;
// I'm don't know how to find the condition the cond jmp depends on :(
// HACK: just assume it's one instruction before the cond jump.
if (BB.size() < 2 || PredBB.size() < 2)
continue;
MCInst* Cmp = &*(BB.rbegin()+1);
MCInst* PredCmp = &*(PredBB.rbegin()+1);
// check that it's a compare instruction
auto &MIB = BC.MIB;
unsigned Opcode = Cmp->getOpcode();
const MCInstrDesc &CmpDesc = BC.MII->get(Opcode);
if (!CmpDesc.isCompare())
continue;
// mostly copied from IdenticalCodeFolding.cpp AreSymbolsIdentical():
auto AreSymbolsIdentical = [&] (const MCSymbol *SymbolA,
const MCSymbol *SymbolB) {
if (SymbolA == SymbolB)
return true;
// Compare symbols as functions.
uint64_t EntryIDA = 0;
uint64_t EntryIDB = 0;
const BinaryFunction *FunctionA =
BC.getFunctionForSymbol(SymbolA, &EntryIDA);
const BinaryFunction *FunctionB =
BC.getFunctionForSymbol(SymbolB, &EntryIDB);
if (FunctionA && EntryIDA)
FunctionA = nullptr;
if (FunctionB && EntryIDB)
FunctionB = nullptr;
if (FunctionA && FunctionB) {
return FunctionA == FunctionB;
}
// One of the symbols represents a function, the other one does not.
if (FunctionA != FunctionB) {
return false;
}
return false;
};
// are both compare instructions the same?
if (!BC.MIB->equals(*Cmp, *PredCmp, AreSymbolsIdentical))
continue;
// nice we found a duplicate jmp
++NumDuplicateCondBranches;
}
Here is a simple example which when compiled shows the optimization opportunity:
typedef int (*Fp)(void);
int f1(){ return 42; }
int f2(){ return 42; }
int f(Fp fp) {
if (fp == f1)
return f1();
if (fp == f2)
return f2();
return fp() +1;
}
int main() {
return 0;
}
// gcc -O0 t.c -o t -Wl,--emit-relocs -no-pie -fno-pic
// llvm-bolt t -o t.bolt -icf --peepholes=all
Do you have any plans to add such a optimization? Could you point me into how to implement it correctly / is there already an optimization doing something similar and just needs to be extended? Is there a analysis which can find the whole chain of comparisons (right now it only checks the one before which will cause issues if the merged symbol names are not all right after each other)?
[1] https://github.com/pyston/pyston/pull/65
Hi Marius, thanks for bringing up this opportunity! I think this optimization can be applied systematically as a combination of data-flow and value numbering optimizations: Redundant moves are removed by proving the equivalence of rcx value numbers, (same with cmp), then all jmps would consume EFLAGS produced by the same cmp (data-flow graph traversal). At that point it would be possible to inspect the jumps' condition codes, conclude they're identical and remove them as redundant.
However, BOLT currently lacks the VN and DF framework to apply this optimization in this manner. However, implementing these is in my personal plans for after we upstream BOLT to LLVM, so I'll keep that in mind. Stay tuned :)
I agree with Amir. Good catch! Thanks for reporting. Your code is pretty good. I have a few comments on the code itself:
// I'm don't know how to find the condition the cond jmp depends on :(
I don't think we have that ready, so you would need to implement a new method in X86MCPlusBuilder.cpp that walks backwards starting at the conditional jump until you find the first instruction that defines FLAGS and return that.
++NumDuplicateCondBranches;
Now, here what you want to do is to count dynamic occurrences of this event, instead of static, so you will have a much clear understanding about the impact of this optimization for your workload. Suppose that you found ~200 instances where this optimization would trigger, but then we have little idea if the code is hitting these 200 instances a lot or not. For that, you will want to use an updated profile and then query how hot each basic block is, and tally that:
DynamicNumDuplicateCondBranches += BB.getKnownExecutionCount();
Is there a analysis which can find the whole chain of comparisons (right now it only checks the one before which will cause issues if the merged symbol names are not all right after each other)?
I wrote an instruction pattern matching mechanism that you can use to walk though simple, local (bb-only) data dependencies. https://github.com/facebookincubator/BOLT/blob/rebased/bolt/src/MCPlusBuilder.h#L561 But I admit it got more complicated than I intended, and it is probably too awkward to use at this stage, and sometimes it's just easier to write the code your self by looking at regs defs/uses for each instruction in the BB. Take a look at the code for the matcher (walk though it with a debugger) and I think it will be easier to get the idea, but the key here for your specific case is just to query MCInstrDesc::getImplicitDefs() to determine whether your instruction is defining EFLAGS and return that if it is the last instruction that defined EFLAGS before your conditional jump. You need to put that code in X86MCPlusBuilder.cpp because it will depend on X86-specific LLVM enums.