Improve incremental repair running times so that megaboom can be done in 24 hours
Description
repair_timing breaks the 24 hour budget for megaboom, so repair_timing has been disabled.
Note that repair_timing is not the biggest problem in megaboom(the clock period is ca. 3000ps now and I believe repair timing is a small fraction of that...), so this is a forward looking feature request to when macro placement has been addressed.
Untar https://drive.google.com/file/d/14-UYL5iUD1GWlL10sYCOPIB0jOpVYDzd/view?usp=sharing
./run-me-BoomTile-asap7-base.sh
[deleted]
[INFO GRT-0018] Total wirelength: 40146727 um
[INFO GRT-0014] Routed nets: 1976482
Perform buffer insertion...
[INFO RSZ-0058] Using max wire length 162um.
[after hours, still running]
[deleted]
repair_timing -verbose -repair_tns 0
[INFO RSZ-0094] Found 189600 endpoints with setup violations.
[INFO RSZ-0099] Repairing 1 out of 189600 (0.00%) violating endpoints...
Iter | Removed | Resized | Inserted | Cloned | Pin | WNS | TNS | Viol | Worst
| Buffers | Gates | Buffers | Gates | Swaps | | | Endpts | Endpt
---------------------------------------------------------------------------------------------------
0 | 0 | 0 | 0 | 0 | 0 | -4099.470 | -410877536.0 | 189600 | core/iregister_read/exe_reg_rs1_data_0\[1\]$_DFF_P_/D
[deleted]
[28\]$_DFFE_PP_/D
185* | 5 | 22 | 39 | 15 | 38 | -3812.590 | -359985248.0 | 189577 | lsu/ldq_31_bits_uop_debug_inst\[28\]$_DFFE_PP_/D
[still running after 10 hours or so, stopped]
Suggested Solution
Make repair_timing faster
Additional Context
No response
after 8 hours or so, the stack trace is the below...
Hmm.... throw is used to abandon things when searching for a solution, could this be done without exceptions to improve performance?
__cxa_throw (@__cxa_throw:3)
sta::DmpPi::evalDmpEqns() (.cold) (@sta::DmpPi::evalDmpEqns() (.cold):20)
sta::DmpAlg::findDriverParams(double) (@sta::DmpAlg::findDriverParams(double):543)
sta::DmpPi::gateDelaySlew(double&, double&) (@sta::DmpPi::gateDelaySlew(double&, double&):21)
sta::DmpCeffDelayCalc::gateDelay(sta::Pin const*, sta::TimingArc const*, float const&, float, sta::Parasitic const*, std::map<sta::Pin const*, unsigned long, sta::PinIdLess, std::allocator<std::pair<sta::Pin const* const, unsigned long>>> const&, sta::DcalcAnalysisPt const*) (@sta::DmpCeffDelayCalc::gateDelay(sta::Pin const*, sta::TimingArc const*, float const&, float, sta::Parasitic const*, std::map<sta::Pin const*, unsigned long, sta::PinIdLess, std::allocator<std::pair<sta::Pin const* const, unsigned long>>> const&, sta::DcalcAnalysisPt const*):92)
sta::GraphDelayCalc::findDriverArcDelays(sta::Vertex*, sta::MultiDrvrNet const*, sta::Edge*, sta::TimingArc const*, std::map<sta::Pin const*, unsigned long, sta::PinIdLess, std::allocator<std::pair<sta::Pin const* const, unsigned long>>>&, sta::DcalcAnalysisPt const*, sta::ArcDelayCalc*) (@sta::GraphDelayCalc::findDriverArcDelays(sta::Vertex*, sta::MultiDrvrNet const*, sta::Edge*, sta::TimingArc const*, std::map<sta::Pin const*, unsigned long, sta::PinIdLess, std::allocator<std::pair<sta::Pin const* const, unsigned long>>>&, sta::DcalcAnalysisPt const*, sta::ArcDelayCalc*):111)
sta::GraphDelayCalc::findDriverEdgeDelays(sta::Vertex*, sta::MultiDrvrNet const*, sta::Edge*, sta::ArcDelayCalc*, std::array<bool, 2ul>&) (@sta::GraphDelayCalc::findDriverEdgeDelays(sta::Vertex*, sta::MultiDrvrNet const*, sta::Edge*, sta::ArcDelayCalc*, std::array<bool, 2ul>&):71)
sta::GraphDelayCalc::findDriverDelays1(sta::Vertex*, sta::MultiDrvrNet*, sta::ArcDelayCalc*) (@sta::GraphDelayCalc::findDriverDelays1(sta::Vertex*, sta::MultiDrvrNet*, sta::ArcDelayCalc*):108)
sta::GraphDelayCalc::findDriverDelays(sta::Vertex*, sta::ArcDelayCalc*) (@sta::GraphDelayCalc::findDriverDelays(sta::Vertex*, sta::ArcDelayCalc*):31)
sta::FindVertexDelays::visit(sta::Vertex*) (@sta::FindVertexDelays::visit(sta::Vertex*):72)
sta::BfsIterator::visit(int, sta::VertexVisitor*) (@sta::BfsIterator::visit(int, sta::VertexVisitor*):72)
sta::GraphDelayCalc::findDelays(int) (@sta::GraphDelayCalc::findDelays(int):48)
sta::Sta::searchPreamble() (@sta::Sta::searchPreamble():25)
sta::Sta::findRequireds() (@sta::Sta::findRequireds():7)
rsz::RepairSetup::repairSetupLastGasp(rsz::OptoParams const&, int&) (@rsz::RepairSetup::repairSetupLastGasp(rsz::OptoParams const&, int&):437)
rsz::RepairSetup::repairSetup(float, double, int, bool, bool, bool, bool, bool) (@rsz::RepairSetup::repairSetup(float, double, int, bool, bool, bool, bool, bool):739)
_wrap_repair_setup (@_wrap_repair_setup:124)
TclNRRunCallbacks (@TclNRRunCallbacks:38)
___lldb_unnamed_symbol1507 (@___lldb_unnamed_symbol1507:348)
Tcl_EvalEx (@Tcl_EvalEx:11)
@jeffng-or @precisionmoon @maliberty In addition to improving performance, can we add some logic to abandon futile repair timing automatically?
WNS is -3812.590ps for 3000ps clock period. Isn't repair_timing an exercise in futility at that point?
I continues as long as there is some progress, not based on the clock period.
Add an option to stop repairing at a certain threshold?
TIMING_REPAIR_UNTIL=-1300
(If TIMING_REPAIR_UNTIL is blank/not set, default, then behavior is unchanged compared to today)
For exploring placement & global route issues in megaboom, then running repair for 10 minutes until you get from -20000 to -1300 might be more than good enough to work on placement & global route issue?
Running until -1000ps might take many more hours... Eventually for a final run when all issues are resolved, it is probably is worth doing the last gasp repair timing...
Without timing repair, WNS=-20000. Even after a few iterations, a few minutes, it is enormously improved to -1300 or so.
A "Seconds" column would be useful, unless the number iterations is roughly proportional to time? (I don't have a sense of whether that is the case)
Iter | Removed | Resized | Inserted | Cloned | Pin | WNS | TNS | Viol | Worst
| Buffers | Gates | Buffers | Gates | Swaps | | | Endpts | Endpt
---------------------------------------------------------------------------------------------------
0 | 0 | 0 | 0 | 0 | 0 | -2195.436 | -114376600.0 | 165887 | core/int_issue_unit/io_dis_uops_0_ready_REG$_DFF_P_/D
10 | 0 | 7 | 0 | 1 | 1 | -1346.351 | -80619768.0 | 165887 | core/int_issue_unit/io_dis_uops_0_ready_REG$_DFF_P_/D
20 | 0 | 9 | 2 | 2 | 7 | -1300.067 | -77720360.0 | 165887 | core/int_issue_unit/io_dis_uops_0_ready_REG$_DFF_P_/D
30 | 0 | 11 | 3 | 3 | 13 | -1288.946 | -77026352.0 | 165887 | core/iregister_read/exe_reg_rs1_data_0\[55\]$_DFF_P_/D
SETUP_SLACK_MARGIN/HOLD_SLACK_MARGIN already do this. They were intended for over-fixing but nothing prevents you from using them for under-fixing afaik.
Nice! I just need to learn how to use them then. Any examples?
microwatt though with a positive value
microwatt though with a positive value
Would setting a clock period that gives us a positive slack quickly in timing repair have the same effect?
megaboom has a 1200ps clock period and gets to negative slack -1200 relatively quickly, so I should see timing repair complete quickly if I set clock period to 1200+1200 = 2400ps?
Typo... https://github.com/The-OpenROAD-Project/OpenROAD-flow-scripts/pull/2497
Though... keeping ABC_CLOCK_PERIOD stable has the advantage of that we don't have to redo synthesis, whereas we could do a sweep of SETUP_SLACK_MARGIN/HOLD_SLACK_MARGIN. More easily anyway.
https://github.com/The-OpenROAD-Project/OpenROAD-flow-scripts/pull/2498
Would setting a clock period that gives us a positive slack quickly in timing repair have the same effect?
Yes.
Moot, use SETUP_SLACK_MARGIN with negative value to terminate retiming early