OpenROAD icon indicating copy to clipboard operation
OpenROAD copied to clipboard

Dynamic node fails with routing congestion

Open gadfort opened this issue 6 months ago • 18 comments

Describe the bug

Between https://github.com/The-OpenROAD-Project/OpenROAD/compare/4b729ea474e9bd0a3513579b41a8c96fe20dc75e...6e29df0746191b48963a35bf45422d76e2c9922c something changed that is causing a routing congestion issue on a previously working design. It appears that the change that broke this was after 588bc37d48f2a5ffef6f2b0d93f24e1810a13038

Expected Behavior

No routing congestion issue

Environment

OpenROAD v2.0-22186-g6e29df0746 
Features included (+) or not (-): +GPU +GUI +Python
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.

To Reproduce

sc_issue_dynamic_node_job0_skywater130_sky130hd_route.global0_20250615-011417.tar.gz

tar xvf sc_issue_dynamic_node_job0_skywater130_sky130hd_route.global0_20250615-011417.tar.gz
cd sc_issue_dynamic_node_job0_skywater130_sky130hd_route.global0_20250615-011417
./run.sh

Relevant log output

[INFO GRT-0101] Running extra iterations to remove overflow.
[INFO GRT-0103] Extra Run for hard benchmark.
[WARNING GRT-0230] Congestion iterations cannot increase overflow, reached the maximum number of times the total overflow can be increased.
[INFO GRT-0197] Via related to pin nodes: 52735
[INFO GRT-0198] Via related Steiner nodes: 2330
[INFO GRT-0199] Via filling finished.
[INFO GRT-0111] Final number of vias: 95509
[INFO GRT-0112] Final usage 3D: 393857

[INFO GRT-0096] Final congestion report:
Layer         Resource        Demand        Usage (%)    Max H / Max V / Total Overflow
---------------------------------------------------------------------------------------
li1                  0             0            0.00%             0 /  0 /  0
met1             30211         25809           85.43%             4 /  1 / 799
met2             38127         36714           96.29%             0 / 10 / 3532
met3             32346         27842           86.08%             4 /  1 / 826
met4             14910         12881           86.39%             0 /  5 / 1358
met5              5112          4084           79.89%             2 /  0 / 140
---------------------------------------------------------------------------------------
Total           120706        107330           88.92%            10 / 17 / 6655

[INFO GRT-0018] Total wirelength: 1045488 um
[INFO GRT-0014] Routed nets: 7246
[ERROR GRT-0116] Global routing finished with congestion. Check the congestion regions in the DRC Viewer.
[ERROR FLW-0001] Global routing failed, saving database to reports/dynamic_node_top_wrap.globalroute-error.odb
Error: sc_global_route.tcl, 64 FLW-0001

Screenshots

No response

Additional Context

No response

gadfort avatar Jun 15 '25 14:06 gadfort

@gadfort sometimes grt is the symptom rather than the cause. Does anything differ substantially in the good/bad results pre-grt? grt itself hasn't had a lot of big changes recently.

maliberty avatar Jun 15 '25 15:06 maliberty

Would you include a working run as well for comparison?

maliberty avatar Jun 15 '25 15:06 maliberty

@maliberty there are a whole host of instances named "placeXX" which I haven't noticed before (they seem to come from this PR https://github.com/The-OpenROAD-Project/OpenROAD/pull/7485). I will upload a working run in a little while.

gadfort avatar Jun 15 '25 15:06 gadfort

Here is a working one from 588bc37: https://drive.google.com/file/d/1cSI4y8LkhVK1xZUtJABpgQRvhxv9pwo4/view?usp=sharing

gadfort avatar Jun 15 '25 15:06 gadfort

@povik #7485 is your PR so please investigate. I've started a quick look as well.

maliberty avatar Jun 15 '25 18:06 maliberty

@gadfort where can I find a gpl log for the failing run?

povik avatar Jun 15 '25 18:06 povik

Log from failing build: place.global.log

Log from working build: place.global.log

gadfort avatar Jun 15 '25 21:06 gadfort

Thanks Peter

The first RD in the failing run spikes up the target density:

[INFO GPL-0063] New Target Density:             0.9584

vs

[INFO GPL-0063] New Target Density:             0.6136

on the passing run.

Final densities (after all TD/RD) are 0.84 vs 0.69.

The cell areas after gpl are similar (old):

Design area 106462 u^2 44% utilization.
Cell type report:                       Count       Area
  Tap cell                               3202    4006.34
  Timing Repair Buffer                    284    4762.07
  Inverter                                151     844.56
  Sequential cell                        2257   65683.00
  Multi-Input combinational cell         3303   35172.48
  Total                                  9197  110468.45

and (new)

Design area 107538 u^2 44% utilization.
Cell type report:                       Count       Area
  Tap cell                               3202    4006.34
  Timing Repair Buffer                    951    4827.13
  Inverter                                151     854.57
  Sequential cell                        2257   65683.00
  Multi-Input combinational cell         3303   36173.44
  Total                                  9864  111544.48

povik avatar Jun 16 '25 09:06 povik

The seems like a very large jump in density. Perhaps @gudeh work to do filler decreasing instead would help. I am curious why the density jumps so much in one run vs the other when the final area is so similar.

maliberty avatar Jun 16 '25 14:06 maliberty

Log from failing build: place.global.log

Log from working build: place.global.log

On the failing log I noticed this warning on the second timing-driven gpl iteration: [WARNING RSZ-2005] cannot find a viable buffering solution on pin _5181_/Y after 1 rounds of buffering (no solution meets design rules).

On the working log we have 13 routability iterations with around 1% area inflation each, which is ok. While the failing log we have a 76% increase in area already on the first routability iteration, coming from a really high routing congestion. I will investigate further

gudeh avatar Jun 16 '25 16:06 gudeh

On the failing log I noticed this warning on the second timing-driven gpl iteration: [WARNING RSZ-2005] cannot find a viable buffering solution on pin 5181/Y after 1 rounds of buffering (no solution meets design rules).

That's worth looking at but it's probably not related to the excessive inflation or the congestion.

povik avatar Jun 16 '25 16:06 povik

sc_issue_dynamic_node_job0_skywater130_sky130hd_place.global0_20250616-083250.tar.gz

@povik FYI

@gadfort I am getting this error with such tarball:

[ERROR ODB-0477] dbTable mask/shift mismatch 127/7 vs 4095/12
[ERROR GUI-0070] ODB-0477

gudeh avatar Jun 16 '25 19:06 gudeh

This will be my change - let me take a look.

maliberty avatar Jun 16 '25 20:06 maliberty

This file must have been generated in the window in which I broke backward compatibility which I've subsequently fixed (#7589). To load this particular db you'll need the old bug back. In dbBlock.h undo:

-  dbTable<_dbDft>* _dft_tbl;
+  dbTable<_dbDft, 4096>* _dft_tbl;

maliberty avatar Jun 16 '25 20:06 maliberty

I had to checkout to commit 6e29df0746191b48963a35bf45422d76e2c9922c to reproduce the error.

The working run is for routing, it could be useful to have the working run for global placement.

Here is the final placement, and RUDY at the end of placement:

Image

Image

I noticed an unusual initial placement, with already 70% overflow (we usually start with ~100% on ORFS designs):

Image

It seems we might be calling rsz too early for this design. Because we go from 70% back to 95% overflow (on iteartion 100) after two timing-driven iterations. The usual progression of overflow is from a high value to a low value.

We also seem to have a divergence happening during routability mode.

It seems to me the main cause of the issue is the unusual initial placement.

gudeh avatar Jun 17 '25 11:06 gudeh

here is a gif showing the placement behavior Image

gudeh avatar Jun 17 '25 11:06 gudeh

the failing run without initialPlace looks a lot better:

Image

gudeh avatar Jun 20 '25 09:06 gudeh

This is the failing run during initialPlacement with CG method. I highlighted the nets of three clusters manually.

Image

Then we run nesterov, which is twisting instances all the way from the left to the right, which is really awkward:

Image

This is nesterov without initialPlace, which results in the heatmap I sent on previous message.

Image

gudeh avatar Jun 20 '25 10:06 gudeh

The conjugate gradient (CG) method is always used to set an initial placement for nesterov (both run in stage 3-3 in ORFS).

I ran a test-CI with skip_initial_placement, removing the CG procedure and executing only nesterov. It degraded metrics quite well.

Looking at @gadfort design I thought the CG initial placement was mismatching completely with nesterov, since clusters set by the left with the initial placer (first gif) were being brought to the right during nesterov (second gif). Although, removing the timing-driven option, CG initial placement and nesterov does match, as can be seen on nesterov maintaining the clusters roughly where the initial placer put them. We do not see them twisting to one side than another:

Image

and here is final rudy without timing-driven

Image

I think a way to avoid the issue is to not run timing-driven iterations if we start with a low overflow. For example, if we start with a 0.70 overflow (such as this design), we do not run the timing-driven iterations with thresholds higher than 0.70.

gudeh avatar Jun 23 '25 13:06 gudeh

For example, if we start with a 0.70 overflow (such as this design), we do not run the timing-driven iterations with thresholds higher than 0.70.

That makes sense but it might be good to run the same number of TD iterations just at lower thresholds. E.g. if we start with 0.70 we cap any tresholds at 0.60 to give it a bit of margin. This might become complicated when combined with RD since I think RD uses the assumption no TD iterations happen in the span 0.30-0.60 or similar.

povik avatar Jun 23 '25 14:06 povik

yes, RD currently takes hold of 0.60-0.30 range.

gudeh avatar Jun 23 '25 14:06 gudeh

well, trying my idea we end up removing only the first timing-driven call (at 0.79), we still get one timing-driven call (0.64) before saving the RD snapshot (0.60), and still get the twisting followed by a high routing congestion.

I am not yet sure how we can avoid this issue and maintain the first two non-virtual timing-driven calls for this design.

gudeh avatar Jun 23 '25 15:06 gudeh

Following @povik's suggestion, I tested setting the maximum TD weight to 1.0 (no effect), but the issue persisted.

I also tested using the first two TD iterations as virtual, and the problem did not occur, suggesting that buffer insertion is triggering the issue, while TD weights alone do not seem to be the cause.

I believe we could revert to using two virtual iterations at the start of placement while we investigate a more permanent solution. What do you guys think?

Additionally, I spoke with Matt, and we noticed some unexpected behavior related to the starting positions used in GPL, though I believe this is unrelated to the current issue.

gudeh avatar Jun 24 '25 17:06 gudeh

I ran a test-CI using -skip_initial_place and here is the absolute difference of DPL and DRT wirelength versus master. We get a big increase of wirelength on two designs, no much change on the rest of designs.

Image

I was talking with @eder-matheus about this today. Here are the values in google sheets

gudeh avatar Jun 25 '25 17:06 gudeh

@gudeh even with skip-initial-place the design is not routable., so that doesn't do anything useful to unstick me. Maybe:

  1. disable the new pass until these issues are resolved
  2. provide a method to disable the new pass

either would be able to unstick this.

gadfort avatar Jun 27 '25 15:06 gadfort

@gadfort If a per design workaround is of use you can try -keep_resize_below_overflow 0.0

povik avatar Jun 27 '25 15:06 povik

@povik @gudeh it was an assumption that the routing / density issue originated in global place, but is it possible that it's somewhere else? Since the first choice didn't solve it, I suppose I can just disable timing driven all together (not a great solution either, but might unstick me for the time being).

gadfort avatar Jun 27 '25 17:06 gadfort

Since the first choice didn't solve it

Not sure I follow. What first choice do you mean?

povik avatar Jun 27 '25 17:06 povik