[ERROR MPL-0040] Failed on cluster
13 minutes to reproduce:
untar https://drive.google.com/file/d/18n0z4_Bk9Gscy3RRCiU6FiNvghIb6zIG/view?usp=drive_link
$ time ./run-me-BoomTile-asap7-base.sh
OpenROAD v2.0-15340-g7ebef4425
Features included (+) or not (-): +Charts +GPU +GUI +Python
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
HierRTLMP Flow enabled...
rtl_macro_placer -halo_width 20 -halo_height 20 -report_directory .//objects/asap7/BoomTile/base/rtlmp -target_util 0.60
Floorplan Outline: (0.0, 0.0) (2160.73, 2160.73), Core Outline: (1.026, 1.08) (2159.73, 2159.73)
Number of std cell instances: 1743858
Area of std cell instances: 220985.73
Number of macros: 72
Area of macros: 691249.12
Halo width: 20.00
Halo height: 20.00
Area of macros with halos: 1292620.62
Area of std cell instances + Area of macros: 912234.88
Core area: 4659886.50
Design Utilization: 0.20
Core Utilization: 0.06
Manufacturing Grid: 1
[ERROR MPL-0040] Failed on cluster frontend/bpd/banked_predictors_1/btb
Error: macro_place_util.tcl, 143 MPL-0040
Originally posted by @oharboe in https://github.com/The-OpenROAD-Project/megaboom/issues/97#issuecomment-2308988697
@tonywk What is this, wrong issue?
That's a bot. GitHub is going under a surge of bots hosted by certain people from the Russian LUMMA forums. Backed by the government, their goal is apparently to steal as much crypto currency as possible.
Post deleted
Should this be an OR issue or is it related to the setup of megaboom?
Should this be an OR issue or is it related to the setup of megaboom?
Unknown. I have the reproduction case this morning, but I don't know anything about what is going on.
Please advice.
If you want OR developers to look at then it is best to file with OR. We don't track megaboom issues.
@maliberty Can you transfer this issue to OpenROAD? I don't have the access permissions
@AcKoucher please give this high priority (a workaround or a solution)
@AcKoucher @maliberty Found a workaround, tweak initial conditions
@AcKoucher @maliberty Found a workaround, tweak initial conditions
From the megaboom PR:
I would like to see an initial diagnosis from @AcKoucher first... but yes. I hope the problem is just some existing rare problem that presents itself with some unfortunate initial conditions and that it can be solved in due course but without urgency.
It looks like there's a combination of things that make this somewhat peculiar.
- After clustering we end up with multiple mixed clusters which are made of some few std cells and a macro (3, 7, 32, 35, 16, 17, 29).
- The dead space filling that we apply to mixed clusters during hierarchical macro placement annealing have meaningful effect in just three clusters (4 --> 7, 8 --> 11, 29 --> 32).
- There are way to many tiny std cell clusters.
Now, the actual problem seems to be that when we get to the point of placing the children of the cluster 4 in the first image - 7 in the second image after dead space filling - even with the target util variation SA can't fit the clusters in the outline. Apparently this happens, because the outline penalty never wins the fight against the boundary penalty.
However there's something going on with the wire length, because for all the steps, I see zero at the debug report (perhaps it's too small I have to check).
------ Penalty ------
Area 1.0186
Outline Penalty 0.4646
Wirelength 0.0000
Boundary Penalty 102.0848
Normalized Cost 55.1202
My first suggestion would be to try decreasing the halos as @oharboe already did or decrease the boundary penalty. @maliberty It looks like there's a lot going on, do you have some idea of what to aim first?
Thanks! Sounds like this is in good hands and well understood. No longer urgent for my part as we have a workaround.
@oharboe Ok :-) I'm investigating what is going on with the clustering so we can have a proper fix.
Another workaround I'm trying out is to save a macro placement. With a saved macro placement, I should avoid rtlmp errors due to slight changes in initial conditions, like changed PLACE_DENSITY.
write_macro_placement macros.tcl
@AcKoucher Please confirm that the fixes work on the full testcase of 1 hour
I included a faster, 13 minute, testcase here, that I produced from the full testcase with deltaDebug.py.
There is a risk that deltaDebug.py identified other bugs than the original bug...
@AcKoucher Please confirm that the fixes work on the full testcase of 1 hour
I included a faster, 13 minute, testcase here, that I produced from the full testcase with deltaDebug.py.
There is a risk that deltaDebug.py identified other bugs than the original bug...
ah, the full test-case still fails...
Can you re-delta?
@oharboe As I said in #5666 there are other problems that need to be addressed in other to actually resolve the issue. I'm investigating.
@maliberty New deltadeug: this test case takes ca. 13 minutes and fails on master:
https://drive.google.com/file/d/1klYn7s2_uJBk2Wi-vfPKK_ol8Kwv02sY/view?usp=sharing
@maliberty @jeffng-or @AcKoucher Any news on this?
@oharboe Apologies for the delay.
Apparently the cluster engine struggles to do a better job for this testcase, because mpl2 relies on the connections between cells. As there's a considerable reduction of nets due to deltadebug it doesn't look like there's something to be actually improved in clustering with regards to the way it currently works.
That said, the actual problem of this failures lies on the fact that the annealing weights are poorly set for a situation in which we need to fit macros in an outline such as this:
Since we incorporated mpl2 we haven't tuned those weights at all. I'll try a change to the weights that should fix the failure.
@AcKoucher Thanks for the update! Do you think that the failure of mpl2 here is just some unfortunate initial conditions and some slight change in initial conditions would cause it to work?
Megaboom passes today if I enable rtlmp and synth_hierarchical...
HierRTLMP Flow enabled...
rtl_macro_placer -halo_width 19 -halo_height 19 -report_directory bazel-out/k8-fastbuild/bin/objects/asap7/BoomTile/base/rtlmp -target_util 0.20
Floorplan Outline: (0.0, 0.0) (2300.0, 2300.0), Core Outline: (1.026, 1.08) (2297.97, 2297.97)
Number of std cell instances: 1771752
Area of std cell instances: 223780.94
Number of macros: 70
Area of macros: 483139.25
Halo width: 19.00
Halo height: 19.00
Area of macros with halos: 996899.62
Area of std cell instances + Area of macros: 706920.19
Core area: 5275827.50
Design Utilization: 0.13
Core Utilization: 0.05
Manufacturing Grid: 1
Elapsed time: 11:45.52[h:]min:sec. CPU time: user 9618.46 sys 84.95 (1375%). Peak memory: 9672176KB.
My thoughts are that even if we can get this test to work with some different initial conditions, it shouldn't be too hard for SA to deal with what's going on here. As I said, the weights for SA were never tuned after its incorporation. This testcase is a good opportunity to do some tuning - especially after many changes were made to mpl2 (fixes, enhancements, refactoring..)