Dynamic node fails with routing congestion
Describe the bug
Between https://github.com/The-OpenROAD-Project/OpenROAD/compare/4b729ea474e9bd0a3513579b41a8c96fe20dc75e...6e29df0746191b48963a35bf45422d76e2c9922c something changed that is causing a routing congestion issue on a previously working design. It appears that the change that broke this was after 588bc37d48f2a5ffef6f2b0d93f24e1810a13038
Expected Behavior
No routing congestion issue
Environment
OpenROAD v2.0-22186-g6e29df0746
Features included (+) or not (-): +GPU +GUI +Python
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
To Reproduce
sc_issue_dynamic_node_job0_skywater130_sky130hd_route.global0_20250615-011417.tar.gz
tar xvf sc_issue_dynamic_node_job0_skywater130_sky130hd_route.global0_20250615-011417.tar.gz
cd sc_issue_dynamic_node_job0_skywater130_sky130hd_route.global0_20250615-011417
./run.sh
Relevant log output
[INFO GRT-0101] Running extra iterations to remove overflow.
[INFO GRT-0103] Extra Run for hard benchmark.
[WARNING GRT-0230] Congestion iterations cannot increase overflow, reached the maximum number of times the total overflow can be increased.
[INFO GRT-0197] Via related to pin nodes: 52735
[INFO GRT-0198] Via related Steiner nodes: 2330
[INFO GRT-0199] Via filling finished.
[INFO GRT-0111] Final number of vias: 95509
[INFO GRT-0112] Final usage 3D: 393857
[INFO GRT-0096] Final congestion report:
Layer Resource Demand Usage (%) Max H / Max V / Total Overflow
---------------------------------------------------------------------------------------
li1 0 0 0.00% 0 / 0 / 0
met1 30211 25809 85.43% 4 / 1 / 799
met2 38127 36714 96.29% 0 / 10 / 3532
met3 32346 27842 86.08% 4 / 1 / 826
met4 14910 12881 86.39% 0 / 5 / 1358
met5 5112 4084 79.89% 2 / 0 / 140
---------------------------------------------------------------------------------------
Total 120706 107330 88.92% 10 / 17 / 6655
[INFO GRT-0018] Total wirelength: 1045488 um
[INFO GRT-0014] Routed nets: 7246
[ERROR GRT-0116] Global routing finished with congestion. Check the congestion regions in the DRC Viewer.
[ERROR FLW-0001] Global routing failed, saving database to reports/dynamic_node_top_wrap.globalroute-error.odb
Error: sc_global_route.tcl, 64 FLW-0001
Screenshots
No response
Additional Context
No response
@gadfort sometimes grt is the symptom rather than the cause. Does anything differ substantially in the good/bad results pre-grt? grt itself hasn't had a lot of big changes recently.
Would you include a working run as well for comparison?
@maliberty there are a whole host of instances named "placeXX" which I haven't noticed before (they seem to come from this PR https://github.com/The-OpenROAD-Project/OpenROAD/pull/7485). I will upload a working run in a little while.
Here is a working one from 588bc37: https://drive.google.com/file/d/1cSI4y8LkhVK1xZUtJABpgQRvhxv9pwo4/view?usp=sharing
@povik #7485 is your PR so please investigate. I've started a quick look as well.
@gadfort where can I find a gpl log for the failing run?
Thanks Peter
The first RD in the failing run spikes up the target density:
[INFO GPL-0063] New Target Density: 0.9584
vs
[INFO GPL-0063] New Target Density: 0.6136
on the passing run.
Final densities (after all TD/RD) are 0.84 vs 0.69.
The cell areas after gpl are similar (old):
Design area 106462 u^2 44% utilization.
Cell type report: Count Area
Tap cell 3202 4006.34
Timing Repair Buffer 284 4762.07
Inverter 151 844.56
Sequential cell 2257 65683.00
Multi-Input combinational cell 3303 35172.48
Total 9197 110468.45
and (new)
Design area 107538 u^2 44% utilization.
Cell type report: Count Area
Tap cell 3202 4006.34
Timing Repair Buffer 951 4827.13
Inverter 151 854.57
Sequential cell 2257 65683.00
Multi-Input combinational cell 3303 36173.44
Total 9864 111544.48
The seems like a very large jump in density. Perhaps @gudeh work to do filler decreasing instead would help. I am curious why the density jumps so much in one run vs the other when the final area is so similar.
Log from failing build: place.global.log
Log from working build: place.global.log
On the failing log I noticed this warning on the second timing-driven gpl iteration: [WARNING RSZ-2005] cannot find a viable buffering solution on pin _5181_/Y after 1 rounds of buffering (no solution meets design rules).
On the working log we have 13 routability iterations with around 1% area inflation each, which is ok. While the failing log we have a 76% increase in area already on the first routability iteration, coming from a really high routing congestion. I will investigate further
On the failing log I noticed this warning on the second timing-driven gpl iteration: [WARNING RSZ-2005] cannot find a viable buffering solution on pin 5181/Y after 1 rounds of buffering (no solution meets design rules).
That's worth looking at but it's probably not related to the excessive inflation or the congestion.
sc_issue_dynamic_node_job0_skywater130_sky130hd_place.global0_20250616-083250.tar.gz
@povik FYI
@gadfort I am getting this error with such tarball:
[ERROR ODB-0477] dbTable mask/shift mismatch 127/7 vs 4095/12
[ERROR GUI-0070] ODB-0477
This will be my change - let me take a look.
This file must have been generated in the window in which I broke backward compatibility which I've subsequently fixed (#7589). To load this particular db you'll need the old bug back. In dbBlock.h undo:
- dbTable<_dbDft>* _dft_tbl;
+ dbTable<_dbDft, 4096>* _dft_tbl;
I had to checkout to commit 6e29df0746191b48963a35bf45422d76e2c9922c to reproduce the error.
The working run is for routing, it could be useful to have the working run for global placement.
Here is the final placement, and RUDY at the end of placement:
I noticed an unusual initial placement, with already 70% overflow (we usually start with ~100% on ORFS designs):
It seems we might be calling rsz too early for this design. Because we go from 70% back to 95% overflow (on iteartion 100) after two timing-driven iterations. The usual progression of overflow is from a high value to a low value.
We also seem to have a divergence happening during routability mode.
It seems to me the main cause of the issue is the unusual initial placement.
here is a gif showing the placement behavior
sc_issue_dynamic_node_job0_skywater130_sky130hd_place.global0_20250617-082138.tar.gz
@gudeh here is the run the passing commit.
the failing run without initialPlace looks a lot better:
This is the failing run during initialPlacement with CG method. I highlighted the nets of three clusters manually.
Then we run nesterov, which is twisting instances all the way from the left to the right, which is really awkward:
This is nesterov without initialPlace, which results in the heatmap I sent on previous message.
The conjugate gradient (CG) method is always used to set an initial placement for nesterov (both run in stage 3-3 in ORFS).
I ran a test-CI with skip_initial_placement, removing the CG procedure and executing only nesterov. It degraded metrics quite well.
Looking at @gadfort design I thought the CG initial placement was mismatching completely with nesterov, since clusters set by the left with the initial placer (first gif) were being brought to the right during nesterov (second gif). Although, removing the timing-driven option, CG initial placement and nesterov does match, as can be seen on nesterov maintaining the clusters roughly where the initial placer put them. We do not see them twisting to one side than another:
and here is final rudy without timing-driven
I think a way to avoid the issue is to not run timing-driven iterations if we start with a low overflow. For example, if we start with a 0.70 overflow (such as this design), we do not run the timing-driven iterations with thresholds higher than 0.70.
For example, if we start with a 0.70 overflow (such as this design), we do not run the timing-driven iterations with thresholds higher than 0.70.
That makes sense but it might be good to run the same number of TD iterations just at lower thresholds. E.g. if we start with 0.70 we cap any tresholds at 0.60 to give it a bit of margin. This might become complicated when combined with RD since I think RD uses the assumption no TD iterations happen in the span 0.30-0.60 or similar.
yes, RD currently takes hold of 0.60-0.30 range.
well, trying my idea we end up removing only the first timing-driven call (at 0.79), we still get one timing-driven call (0.64) before saving the RD snapshot (0.60), and still get the twisting followed by a high routing congestion.
I am not yet sure how we can avoid this issue and maintain the first two non-virtual timing-driven calls for this design.
Following @povik's suggestion, I tested setting the maximum TD weight to 1.0 (no effect), but the issue persisted.
I also tested using the first two TD iterations as virtual, and the problem did not occur, suggesting that buffer insertion is triggering the issue, while TD weights alone do not seem to be the cause.
I believe we could revert to using two virtual iterations at the start of placement while we investigate a more permanent solution. What do you guys think?
Additionally, I spoke with Matt, and we noticed some unexpected behavior related to the starting positions used in GPL, though I believe this is unrelated to the current issue.
I ran a test-CI using -skip_initial_place and here is the absolute difference of DPL and DRT wirelength versus master. We get a big increase of wirelength on two designs, no much change on the rest of designs.
I was talking with @eder-matheus about this today. Here are the values in google sheets
@gudeh even with skip-initial-place the design is not routable., so that doesn't do anything useful to unstick me. Maybe:
- disable the new pass until these issues are resolved
- provide a method to disable the new pass
either would be able to unstick this.
@gadfort If a per design workaround is of use you can try -keep_resize_below_overflow 0.0
@povik @gudeh it was an assumption that the routing / density issue originated in global place, but is it possible that it's somewhere else? Since the first choice didn't solve it, I suppose I can just disable timing driven all together (not a great solution either, but might unstick me for the time being).
Since the first choice didn't solve it
Not sure I follow. What first choice do you mean?