BOLT
BOLT copied to clipboard
Same code base, compile as one bin, some application problems about BOLTed bin
- The same code base supports many products, compile as one bin, serves different products using different configures and models case 1: For different products, the code execution path, part is same and part is different BOLTed bin by one product A , can be used to another product B? and whether it may lead to the performance degradation of another product B
case 2: For different products, the code execution path is the same, but the configuration and (machine learning) models are different BOLTed bin by one product A , can be used to another product B? and whether it may lead to the performance degradation of another product B
- What is bolt's application scheme in face book? has such scenario?
If I understand correctly, you are interested in knowing if there are performance improvements to be observed in using the profile captured for workload A for optimizing the binary that will run with workload B, right?
It is hard to answer this in a generic answer that is valid for all cases. In truth, this needs to be measured. You will likely observe some gains as long as the workloads overlap in the functions/path they use. But when we talk about improvements by redoing binary layout, we may get the false idea that improvements have a linear relationship with how many functions we are optimizing. In reality, we need to break these wins in improvements on 3 different metrics: iTLB, i-cache and branch mispredictions.
Gains coming from reduced i-cache and branch misses are tackled by basic block reordering and are more "fine grained" because those come from packing 64B of hot code together. Because of the small size of cache lines, these gains tend to (but not always) be local to a function. Conditional branches (impacting branch misses) are mostly local too. Because of this, if you have complete profile for a single function, you will likely reap most of the icache/branch benefits of this function even if you have missing profile for other functions.
For iTLB gains, those come from code locality of 4KB of code spanning multiple functions. BOLT acts on this by moving hot functions closer together and outlining the cold blocks so the hot region stays free of cold code. What will happen is that random functions that are missing in the profile will cause the working set size of the program to use more pages, stressing iTLB. If the program is large enough, your workload could be suffering with frequent pagewalks, which will significantly degrade performance (presenting an opportunity for improvement that BOLT won't be able to tap into without a more representative and global profile).