vtr-verilog-to-routing icon indicating copy to clipboard operation
vtr-verilog-to-routing copied to clipboard

Placement Refactoring

Open amin1377 opened this issue 1 year ago • 12 comments

This is a work in progress. The full description will be written soon.

amin1377 avatar Sep 15 '23 16:09 amin1377

Titan QoR: Screenshot from 2023-10-20 15-27-43

amin1377 avatar Oct 20 '23 19:10 amin1377

@amin1377 : let's reanimate this one ... please resolve conflicts and get updated CPU time data and post the details here. Perhaps @soheilshahrouz or Robert (@robluo) could help speed things up if there is still a cpu gap.

vaughnbetz avatar May 23 '24 14:05 vaughnbetz

Adding @AlexandreSinger and @MohamedElgammal too. I think this PR has two things: A refactored (cleaner) placement structure, but it is 10% slower. An updated / enhanced version of the re-clustering API. Please let me know if that is correct or incorrect Amin and Mohamed. @soheilshahrouz is interested in reviewing and helping to land, and @AlexandreSinger should be able to build on top of this.

vaughnbetz avatar May 23 '24 19:05 vaughnbetz

Adding in @robluo as he is potentially interested in helping with this one.

vaughnbetz avatar May 30 '24 15:05 vaughnbetz

@robluo @AlexandreSinger : Amin tells me that he added another abstraction layer on top of Mohamed's reclustering API as part of this PR, which should make it easier to call from placement. He did some initial testing on this API layer (which it passed) but the testing isn't that thorough yet. The refactored placement code worked (passed all reg tests), but was ~10% slower. Needs to be rebased with conflicts resolved and that needs to be rerun (@amin1377). This PR also includes enhancements Mohamed made to the reclustering API. Those haven't been landed yet; they are in https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/2304 and Amin thinks it would be cleaner to land them first (@MohamedElgammal) The re-clustering during placement feature doesn't have any unit tests etc. yet in this PR (Amin's testing was manual).

vaughnbetz avatar May 30 '24 15:05 vaughnbetz

Our working theory is that the cost function updates are slower due to static variables (and maybe functions?) are moved out of place.cpp into a bunch of .cpp files (cleaner, but maybe the compiler can't optimize as much). @amin1377 : please share any prior profiling etc. results you have. Also, ensure we have IPO on when compiling.

vaughnbetz avatar May 30 '24 15:05 vaughnbetz

@vaughnbetz @amin1377 I would be very surprised if the problem is coming from the cost function updates. Especially with LTO and IPO turned on it would not make sense to me that moving functions into their own methods would cause a 10% degradation in run-time across the overall placement.

I would like to see if the amount of work being performed is the same (the same number of swaps are being performed) and then I would like to see some A/B profiling results. Seeing the output of GPROF before and after may tell us if a specific function is causing the problem or not. If both of those are tied, I am wondering if there may be memory affects that we are dealing with. Perhaps the elements of a data structure was made slightly bigger, causing each element to not align with cache lines; but, this is more of a guess without profiling data.

Perhaps we can do perf results on the IPO/LTO optimized code, since perf can operate at runtime. We may be able to see if it is memory instructions or something else. Just an idea.

AlexandreSinger avatar May 30 '24 15:05 AlexandreSinger

Titan: image

amin1377 avatar Jun 13 '24 12:06 amin1377

Titan other: image

amin1377 avatar Jun 13 '24 12:06 amin1377

Looks promising!

  1. Resolve conflicts
  2. Run koios & VTR to get more CPU times. These ones are looking promising.
  3. If there's a runtime gap, we should then profile to see if there are any obvious hotspots. But we don't need to get exact runtime parity if it proves really hard.

vaughnbetz avatar Jun 13 '24 14:06 vaughnbetz

Please attach Koios and VTR run results.

vaughnbetz avatar Jun 21 '24 17:06 vaughnbetz

QoR on all koios circuits: image

amin1377 avatar Jun 24 '24 22:06 amin1377

You have a warning that is making the various build types fail (we turn on -Werror): [ 81%] Building CXX object vpr/CMakeFiles/libvpr.dir/src/place/net_cost_handler.cpp.o /home/runner/work/vtr-verilog-to-routing/vtr-verilog-to-routing/vpr/src/place/net_cost_handler.cpp:243:13: error: "/" within comment [-Werror=comment] 243 | * @details / Updates the bounding box of a net by storing its coordinates in the bb_coord_new data structure and

vaughnbetz avatar Jul 05 '24 16:07 vaughnbetz

Thanks @amin1377 ! Once you have as many suggestions resolved as you want for this PR (others can go to separate issues), you'll need to run full QoR sweeps:

  • VTR, Titan, Koios on usual 2D architecture each
  • 3D architecture with 3D BB
  • 3D architecture with layer BB

vaughnbetz avatar Jul 05 '24 16:07 vaughnbetz

Adding @soheilshahrouz to the conversation, as he is interested in trying parallel "population annealing", and to do that he would need non-global objects for the placement state and the placement cost.

vaughnbetz avatar Jul 08 '24 21:07 vaughnbetz

QoR after applying comments: Baseline: Master branch (121a16c5718322e10fa5efa2d9badd946e7) Placement Refac Branch (d0008743040de1f739949dbccc35958a5) Titan: image

Large VTR: image

amin1377 avatar Jul 19 '24 15:07 amin1377