OpenROAD icon indicating copy to clipboard operation
OpenROAD copied to clipboard

rsz: Resizer is non-deterministic causing tests to be flaky

Open hongted opened this issue 2 months ago • 10 comments

Describe the bug

Resizer.cc runs are non-deterministic. Running the same test multiple times results in different results.

One reason is the use of ordered sets of pointers, which need to be sorted again before any iteration where order matters.

Example:
https://github.com/The-OpenROAD-Project/OpenROAD/blob/0a9b58c36689b10d3c0db4ce38391e75acddefe9/src/rsz/include/rsz/Resizer.hh#L741 which causes the iteration order here to change https://github.com/The-OpenROAD-Project/OpenROAD/blob/0a9b58c36689b10d3c0db4ce38391e75acddefe9/src/rsz/src/Resizer.cc#L748

As a separate note, as these ordered sets can't be relied on to be ordered, changing to unordered_set can be considered.

Expected Behavior

Resizer code will give the same results when run multiple times.

Environment

N/A

To Reproduce

Run tests multiple times to observe the failure rate:

bazelisk test -c opt //src/rsz/test/... --cache_test_results=no --runs_per_test=100

Relevant log output


Screenshots

No response

Additional Context

No response

hongted avatar Oct 22 '25 01:10 hongted

unordered_set won't help as it will just hash on the pointer value. A comparison should be provided to any set of pointers that will iterated.

maliberty avatar Oct 22 '25 04:10 maliberty

That's the point. The code isn't relying on the sorted property of sets, so an unordered set is no better and can be considered for a performance/memory fragmentation gain.

Sorting or adding a proper compare function would be potential solutions.

hongted avatar Oct 22 '25 04:10 hongted

@hongted Please if you have it at hand post which tests are failing other than the fast buffer reporting tests

povik avatar Oct 22 '25 09:10 povik

Below is a list from my last run:

//src/rsz/test:pin_swap1-tcl_test                                        FAILED in 5 out of 100 in 1.0s
//src/rsz/test:repair_tie11_hier-tcl_test                                FAILED in 2 out of 100 in 0.7s
//src/rsz/test:repair_tie12_hier-tcl_test                                FAILED in 2 out of 100 in 0.7s
//src/rsz/test:resize_slack2-tcl_test                                    FAILED in 1 out of 100 in 1.2s

hongted avatar Oct 22 '25 19:10 hongted

Below command can be used to run those tests

bazelisk test -c opt //src/rsz/test:pin_swap1-tcl_test //src/rsz/test:repair_tie11_hier-tcl_test //src/rsz/test:repair_tie12_hier-tcl_test //src/rsz/test:resize_slack2-tcl_test --cache_test_results=no --runs_per_test=100

Log files for each run can be found under bazel-testlogs/src/rsz/test/<TEST_NAME>/run_<X>_of_<Y>/test.outputs

hongted avatar Oct 22 '25 19:10 hongted

2nd potential location is in the ordering of equivalent cell pins which is a set<LibertyPort*>

https://github.com/The-OpenROAD-Project/OpenROAD/blob/3daea177dd24cd2225b7dc5a11c0e66fd22634e6/src/rsz/src/SwapPinsMove.hh#L53

https://github.com/The-OpenROAD-Project/OpenROAD/blob/3daea177dd24cd2225b7dc5a11c0e66fd22634e6/src/rsz/src/SwapPinsMove.cc#L361

https://github.com/The-OpenROAD-Project/OpenROAD/blob/3daea177dd24cd2225b7dc5a11c0e66fd22634e6/src/rsz/src/SwapPinsMove.cc#L297

hongted avatar Oct 22 '25 20:10 hongted

//src/rsz/test:resize_slack2-tcl_test                                    FAILED in 1 out of 100 in 1.2s

This one looks to be due to unstable order returned from

NetSeq Resizer::resizeWorstSlackNets()
{
  // Find the nets with the worst slack.
  NetSeq nets;
  for (auto pair : net_slack_map_) {
    nets.push_back(pair.first);
  }
  sort(nets, [this](const Net* net1, const Net* net2) {
    return resizeNetSlack(net1) < resizeNetSlack(net2);
  });

  int nworst_nets = std::ceil(nets.size() * worst_slack_nets_percent_ / 100.0);
  if (nets.size() > nworst_nets) {
    nets.resize(nworst_nets);
  }
  return nets;
}

net_slack_map_ is a pointer-keyed map, and there will be nets tied on slack so the sort will not give them stable ordering.

povik avatar Oct 22 '25 20:10 povik

I got tired on this happening with odb objects and created https://github.com/The-OpenROAD-Project/OpenROAD/blob/master/src/odb/include/odb/dbCompare.inc. Perhaps we can do something similar for the relevant sta classes.

maliberty avatar Oct 22 '25 20:10 maliberty

I'll take this one.

calewis avatar Nov 19 '25 19:11 calewis

2nd potential location is in the ordering of equivalent cell pins which is a set<LibertyPort*>

OpenROAD/src/rsz/src/SwapPinsMove.hh

Line 53 in 3daea17

sta::UnorderedMap<LibertyPort*, sta::LibertyPortSet> equiv_pin_map_; OpenROAD/src/rsz/src/SwapPinsMove.cc

Line 361 in 3daea17

for (LibertyPort* port : equiv_ports) { OpenROAD/src/rsz/src/SwapPinsMove.cc

Line 297 in 3daea17

equivCellPins(cell, port, ports);

~~I don't actually see any order dependent side effects in that loop, unless I'm missing something.~~, nvm swap_port depends on the last writer.

calewis avatar Dec 02 '25 21:12 calewis

@hongted between #8942 and https://github.com/parallaxsw/OpenSTA/commit/0dd7d1bbdc3267eaf6261e411e55117575b9a4bc can you confirm if we can close this bug?

calewis avatar Dec 19 '25 08:12 calewis