vtr-verilog-to-routing
vtr-verilog-to-routing copied to clipboard
Placement Refactoring
This is a work in progress. The full description will be written soon.
Titan QoR:
@amin1377 : let's reanimate this one ... please resolve conflicts and get updated CPU time data and post the details here. Perhaps @soheilshahrouz or Robert (@robluo) could help speed things up if there is still a cpu gap.
Adding @AlexandreSinger and @MohamedElgammal too. I think this PR has two things: A refactored (cleaner) placement structure, but it is 10% slower. An updated / enhanced version of the re-clustering API. Please let me know if that is correct or incorrect Amin and Mohamed. @soheilshahrouz is interested in reviewing and helping to land, and @AlexandreSinger should be able to build on top of this.
Adding in @robluo as he is potentially interested in helping with this one.
@robluo @AlexandreSinger : Amin tells me that he added another abstraction layer on top of Mohamed's reclustering API as part of this PR, which should make it easier to call from placement. He did some initial testing on this API layer (which it passed) but the testing isn't that thorough yet. The refactored placement code worked (passed all reg tests), but was ~10% slower. Needs to be rebased with conflicts resolved and that needs to be rerun (@amin1377). This PR also includes enhancements Mohamed made to the reclustering API. Those haven't been landed yet; they are in https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/2304 and Amin thinks it would be cleaner to land them first (@MohamedElgammal) The re-clustering during placement feature doesn't have any unit tests etc. yet in this PR (Amin's testing was manual).
Our working theory is that the cost function updates are slower due to static variables (and maybe functions?) are moved out of place.cpp into a bunch of .cpp files (cleaner, but maybe the compiler can't optimize as much). @amin1377 : please share any prior profiling etc. results you have. Also, ensure we have IPO on when compiling.
@vaughnbetz @amin1377 I would be very surprised if the problem is coming from the cost function updates. Especially with LTO and IPO turned on it would not make sense to me that moving functions into their own methods would cause a 10% degradation in run-time across the overall placement.
I would like to see if the amount of work being performed is the same (the same number of swaps are being performed) and then I would like to see some A/B profiling results. Seeing the output of GPROF before and after may tell us if a specific function is causing the problem or not. If both of those are tied, I am wondering if there may be memory affects that we are dealing with. Perhaps the elements of a data structure was made slightly bigger, causing each element to not align with cache lines; but, this is more of a guess without profiling data.
Perhaps we can do perf results on the IPO/LTO optimized code, since perf can operate at runtime. We may be able to see if it is memory instructions or something else. Just an idea.
Titan:
Titan other:
Looks promising!
- Resolve conflicts
- Run koios & VTR to get more CPU times. These ones are looking promising.
- If there's a runtime gap, we should then profile to see if there are any obvious hotspots. But we don't need to get exact runtime parity if it proves really hard.
Please attach Koios and VTR run results.
QoR on all koios circuits:
You have a warning that is making the various build types fail (we turn on -Werror): [ 81%] Building CXX object vpr/CMakeFiles/libvpr.dir/src/place/net_cost_handler.cpp.o /home/runner/work/vtr-verilog-to-routing/vtr-verilog-to-routing/vpr/src/place/net_cost_handler.cpp:243:13: error: "/" within comment [-Werror=comment] 243 | * @details / Updates the bounding box of a net by storing its coordinates in the bb_coord_new data structure and
Thanks @amin1377 ! Once you have as many suggestions resolved as you want for this PR (others can go to separate issues), you'll need to run full QoR sweeps:
- VTR, Titan, Koios on usual 2D architecture each
- 3D architecture with 3D BB
- 3D architecture with layer BB
Adding @soheilshahrouz to the conversation, as he is interested in trying parallel "population annealing", and to do that he would need non-global objects for the placement state and the placement cost.
QoR after applying comments:
Baseline: Master branch (121a16c5718322e10fa5efa2d9badd946e7)
Placement Refac Branch (d0008743040de1f739949dbccc35958a5)
Titan:
Large VTR: