rsz: Resizer is non-deterministic causing tests to be flaky
Describe the bug
Resizer.cc runs are non-deterministic. Running the same test multiple times results in different results.
One reason is the use of ordered sets of pointers, which need to be sorted again before any iteration where order matters.
Example:
https://github.com/The-OpenROAD-Project/OpenROAD/blob/0a9b58c36689b10d3c0db4ce38391e75acddefe9/src/rsz/include/rsz/Resizer.hh#L741
which causes the iteration order here to change https://github.com/The-OpenROAD-Project/OpenROAD/blob/0a9b58c36689b10d3c0db4ce38391e75acddefe9/src/rsz/src/Resizer.cc#L748
As a separate note, as these ordered sets can't be relied on to be ordered, changing to unordered_set can be considered.
Expected Behavior
Resizer code will give the same results when run multiple times.
Environment
N/A
To Reproduce
Run tests multiple times to observe the failure rate:
bazelisk test -c opt //src/rsz/test/... --cache_test_results=no --runs_per_test=100
Relevant log output
Screenshots
No response
Additional Context
No response
unordered_set won't help as it will just hash on the pointer value. A comparison should be provided to any set of pointers that will iterated.
That's the point. The code isn't relying on the sorted property of sets, so an unordered set is no better and can be considered for a performance/memory fragmentation gain.
Sorting or adding a proper compare function would be potential solutions.
@hongted Please if you have it at hand post which tests are failing other than the fast buffer reporting tests
Below is a list from my last run:
//src/rsz/test:pin_swap1-tcl_test FAILED in 5 out of 100 in 1.0s
//src/rsz/test:repair_tie11_hier-tcl_test FAILED in 2 out of 100 in 0.7s
//src/rsz/test:repair_tie12_hier-tcl_test FAILED in 2 out of 100 in 0.7s
//src/rsz/test:resize_slack2-tcl_test FAILED in 1 out of 100 in 1.2s
Below command can be used to run those tests
bazelisk test -c opt //src/rsz/test:pin_swap1-tcl_test //src/rsz/test:repair_tie11_hier-tcl_test //src/rsz/test:repair_tie12_hier-tcl_test //src/rsz/test:resize_slack2-tcl_test --cache_test_results=no --runs_per_test=100
Log files for each run can be found under bazel-testlogs/src/rsz/test/<TEST_NAME>/run_<X>_of_<Y>/test.outputs
2nd potential location is in the ordering of equivalent cell pins which is a set<LibertyPort*>
https://github.com/The-OpenROAD-Project/OpenROAD/blob/3daea177dd24cd2225b7dc5a11c0e66fd22634e6/src/rsz/src/SwapPinsMove.hh#L53
https://github.com/The-OpenROAD-Project/OpenROAD/blob/3daea177dd24cd2225b7dc5a11c0e66fd22634e6/src/rsz/src/SwapPinsMove.cc#L361
https://github.com/The-OpenROAD-Project/OpenROAD/blob/3daea177dd24cd2225b7dc5a11c0e66fd22634e6/src/rsz/src/SwapPinsMove.cc#L297
//src/rsz/test:resize_slack2-tcl_test FAILED in 1 out of 100 in 1.2s
This one looks to be due to unstable order returned from
NetSeq Resizer::resizeWorstSlackNets()
{
// Find the nets with the worst slack.
NetSeq nets;
for (auto pair : net_slack_map_) {
nets.push_back(pair.first);
}
sort(nets, [this](const Net* net1, const Net* net2) {
return resizeNetSlack(net1) < resizeNetSlack(net2);
});
int nworst_nets = std::ceil(nets.size() * worst_slack_nets_percent_ / 100.0);
if (nets.size() > nworst_nets) {
nets.resize(nworst_nets);
}
return nets;
}
net_slack_map_ is a pointer-keyed map, and there will be nets tied on slack so the sort will not give them stable ordering.
I got tired on this happening with odb objects and created https://github.com/The-OpenROAD-Project/OpenROAD/blob/master/src/odb/include/odb/dbCompare.inc. Perhaps we can do something similar for the relevant sta classes.
I'll take this one.
2nd potential location is in the ordering of equivalent cell pins which is a set<LibertyPort*>
OpenROAD/src/rsz/src/SwapPinsMove.hh
Line 53 in 3daea17
sta::UnorderedMap<LibertyPort*, sta::LibertyPortSet> equiv_pin_map_; OpenROAD/src/rsz/src/SwapPinsMove.cc
Line 361 in 3daea17
for (LibertyPort* port : equiv_ports) { OpenROAD/src/rsz/src/SwapPinsMove.cc
Line 297 in 3daea17
equivCellPins(cell, port, ports);
~~I don't actually see any order dependent side effects in that loop, unless I'm missing something.~~, nvm swap_port depends on the last writer.
@hongted between #8942 and https://github.com/parallaxsw/OpenSTA/commit/0dd7d1bbdc3267eaf6261e411e55117575b9a4bc can you confirm if we can close this bug?