OpenROAD icon indicating copy to clipboard operation
OpenROAD copied to clipboard

[ERROR MPL-0040] Failed on cluster

Open oharboe opened this issue 1 year ago • 18 comments

13 minutes to reproduce:

untar https://drive.google.com/file/d/18n0z4_Bk9Gscy3RRCiU6FiNvghIb6zIG/view?usp=drive_link

$ time ./run-me-BoomTile-asap7-base.sh
OpenROAD v2.0-15340-g7ebef4425
Features included (+) or not (-): +Charts +GPU +GUI +Python
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
HierRTLMP Flow enabled...
rtl_macro_placer -halo_width 20 -halo_height 20 -report_directory .//objects/asap7/BoomTile/base/rtlmp -target_util 0.60
Floorplan Outline: (0.0, 0.0) (2160.73, 2160.73),  Core Outline: (1.026, 1.08) (2159.73, 2159.73)
        Number of std cell instances: 1743858
        Area of std cell instances: 220985.73
        Number of macros: 72
        Area of macros: 691249.12
        Halo width: 20.00
        Halo height: 20.00
        Area of macros with halos: 1292620.62
        Area of std cell instances + Area of macros: 912234.88
        Core area: 4659886.50
        Design Utilization: 0.20
        Core Utilization: 0.06
        Manufacturing Grid: 1

[ERROR MPL-0040] Failed on cluster frontend/bpd/banked_predictors_1/btb
Error: macro_place_util.tcl, 143 MPL-0040

Originally posted by @oharboe in https://github.com/The-OpenROAD-Project/megaboom/issues/97#issuecomment-2308988697

oharboe avatar Aug 26 '24 16:08 oharboe

@tonywk What is this, wrong issue?

That's a bot. GitHub is going under a surge of bots hosted by certain people from the Russian LUMMA forums. Backed by the government, their goal is apparently to steal as much crypto currency as possible.

Weather-OS avatar Aug 26 '24 16:08 Weather-OS

Post deleted

maliberty avatar Aug 26 '24 16:08 maliberty

Should this be an OR issue or is it related to the setup of megaboom?

maliberty avatar Aug 26 '24 16:08 maliberty

Should this be an OR issue or is it related to the setup of megaboom?

Unknown. I have the reproduction case this morning, but I don't know anything about what is going on.

Please advice.

oharboe avatar Aug 26 '24 16:08 oharboe

If you want OR developers to look at then it is best to file with OR. We don't track megaboom issues.

maliberty avatar Aug 26 '24 16:08 maliberty

@maliberty Can you transfer this issue to OpenROAD? I don't have the access permissions

image

oharboe avatar Aug 26 '24 16:08 oharboe

@AcKoucher please give this high priority (a workaround or a solution)

maliberty avatar Aug 26 '24 17:08 maliberty

@AcKoucher @maliberty Found a workaround, tweak initial conditions

oharboe avatar Aug 26 '24 18:08 oharboe

@AcKoucher @maliberty Found a workaround, tweak initial conditions

From the megaboom PR:

I would like to see an initial diagnosis from @AcKoucher first... but yes. I hope the problem is just some existing rare problem that presents itself with some unfortunate initial conditions and that it can be solved in due course but without urgency.

maliberty avatar Aug 26 '24 19:08 maliberty

It looks like there's a combination of things that make this somewhat peculiar.

  1. After clustering we end up with multiple mixed clusters which are made of some few std cells and a macro (3, 7, 32, 35, 16, 17, 29).
  1. The dead space filling that we apply to mixed clusters during hierarchical macro placement annealing have meaningful effect in just three clusters (4 --> 7, 8 --> 11, 29 --> 32).
  1. There are way to many tiny std cell clusters.

Now, the actual problem seems to be that when we get to the point of placing the children of the cluster 4 in the first image - 7 in the second image after dead space filling - even with the target util variation SA can't fit the clusters in the outline. Apparently this happens, because the outline penalty never wins the fight against the boundary penalty.

However there's something going on with the wire length, because for all the steps, I see zero at the debug report (perhaps it's too small I have to check).

------ Penalty ------
Area                       1.0186
Outline Penalty            0.4646
Wirelength                 0.0000
Boundary Penalty         102.0848
Normalized Cost           55.1202

My first suggestion would be to try decreasing the halos as @oharboe already did or decrease the boundary penalty. @maliberty It looks like there's a lot going on, do you have some idea of what to aim first?

AcKoucher avatar Aug 26 '24 20:08 AcKoucher

Thanks! Sounds like this is in good hands and well understood. No longer urgent for my part as we have a workaround.

oharboe avatar Aug 26 '24 20:08 oharboe

@oharboe Ok :-) I'm investigating what is going on with the clustering so we can have a proper fix.

AcKoucher avatar Aug 26 '24 22:08 AcKoucher

Another workaround I'm trying out is to save a macro placement. With a saved macro placement, I should avoid rtlmp errors due to slight changes in initial conditions, like changed PLACE_DENSITY.

write_macro_placement macros.tcl

oharboe avatar Aug 27 '24 04:08 oharboe

@AcKoucher Please confirm that the fixes work on the full testcase of 1 hour

I included a faster, 13 minute, testcase here, that I produced from the full testcase with deltaDebug.py.

There is a risk that deltaDebug.py identified other bugs than the original bug...

oharboe avatar Aug 27 '24 23:08 oharboe

@AcKoucher Please confirm that the fixes work on the full testcase of 1 hour

I included a faster, 13 minute, testcase here, that I produced from the full testcase with deltaDebug.py.

There is a risk that deltaDebug.py identified other bugs than the original bug...

ah, the full test-case still fails...

oharboe avatar Aug 28 '24 04:08 oharboe

Can you re-delta?

maliberty avatar Aug 28 '24 04:08 maliberty

@oharboe As I said in #5666 there are other problems that need to be addressed in other to actually resolve the issue. I'm investigating.

AcKoucher avatar Aug 28 '24 12:08 AcKoucher

@maliberty New deltadeug: this test case takes ca. 13 minutes and fails on master:

https://drive.google.com/file/d/1klYn7s2_uJBk2Wi-vfPKK_ol8Kwv02sY/view?usp=sharing

oharboe avatar Aug 28 '24 14:08 oharboe

@maliberty @jeffng-or @AcKoucher Any news on this?

oharboe avatar Sep 21 '24 07:09 oharboe

@oharboe Apologies for the delay.

Apparently the cluster engine struggles to do a better job for this testcase, because mpl2 relies on the connections between cells. As there's a considerable reduction of nets due to deltadebug it doesn't look like there's something to be actually improved in clustering with regards to the way it currently works.

That said, the actual problem of this failures lies on the fact that the annealing weights are poorly set for a situation in which we need to fit macros in an outline such as this: image

Since we incorporated mpl2 we haven't tuned those weights at all. I'll try a change to the weights that should fix the failure.

AcKoucher avatar Sep 24 '24 17:09 AcKoucher

@AcKoucher Thanks for the update! Do you think that the failure of mpl2 here is just some unfortunate initial conditions and some slight change in initial conditions would cause it to work?

Megaboom passes today if I enable rtlmp and synth_hierarchical...

HierRTLMP Flow enabled...
rtl_macro_placer -halo_width 19 -halo_height 19 -report_directory bazel-out/k8-fastbuild/bin/objects/asap7/BoomTile/base/rtlmp -target_util 0.20
Floorplan Outline: (0.0, 0.0) (2300.0, 2300.0),  Core Outline: (1.026, 1.08) (2297.97, 2297.97)
        Number of std cell instances: 1771752
        Area of std cell instances: 223780.94
        Number of macros: 70
        Area of macros: 483139.25
        Halo width: 19.00
        Halo height: 19.00
        Area of macros with halos: 996899.62
        Area of std cell instances + Area of macros: 706920.19
        Core area: 5275827.50
        Design Utilization: 0.13
        Core Utilization: 0.05
        Manufacturing Grid: 1

Elapsed time: 11:45.52[h:]min:sec. CPU time: user 9618.46 sys 84.95 (1375%). Peak memory: 9672176KB.

oharboe avatar Sep 24 '24 17:09 oharboe

My thoughts are that even if we can get this test to work with some different initial conditions, it shouldn't be too hard for SA to deal with what's going on here. As I said, the weights for SA were never tuned after its incorporation. This testcase is a good opportunity to do some tuning - especially after many changes were made to mpl2 (fixes, enhancements, refactoring..)

AcKoucher avatar Sep 24 '24 17:09 AcKoucher