redex icon indicating copy to clipboard operation
redex copied to clipboard

CommonSubexpressionElimination takes extremely long to run

Open justhecuke opened this issue 2 years ago • 4 comments

Hey,

Our team tried to enable CommonSubexpressionEliminationPass but gave up on it when it took ~50m to run by itself.

After looking at the code, the likely cause of the runtime appears to be the tight CSE -> if !chage break -> else Constant Propagation -> Local DCE loop. We haven't actually profiled the thing but, considering that CSE should take only a few seconds to run, it seems likely.

Can I ask why the tight loop was added? Does it provide some important benefit on top of CSE?

For reference, it only appears to reduce our package size by a small amount, but its odd construction has raised questions we don't know the answers to. I suspect the small size reduction may be due to Proguard already performing this optimization, but I don't have hard proof of this quite yet.

justhecuke avatar Feb 27 '23 19:02 justhecuke

The commit message on https://github.com/facebook/redex/commit/ffb4d1eeab8549fe4789e6404571eb127920a33e might be helpful.

wsanville avatar Feb 28 '23 21:02 wsanville

Somewhat. The reasoning behind the loop is fair enough, I'm just unsure of the benefit and if the runtime is actually what they expect. Considering the number of times they appear to run the pass, I can't imagine that each run takes ~50m for them. Or, if it does, then they must be getting a lot more benefit out of it than I am.

justhecuke avatar Feb 28 '23 22:02 justhecuke

We have not found to be a problem in practice. But I could imagine some pathological code patterns where it does become a performance problem. Could you to try change the code of that loop, impose some fixed limit (maybe 10) for the iteration count? It should be semantically fine to break out of the loop early. Let us know if that solves your problem, and what you find as a reasonable limit, and then we can that as an option.

NTillmann avatar Mar 08 '23 01:03 NTillmann

I've modified the code to only perform a single iteration with no Constant Propagation or LocalDCE and to replace the WeakTopologicalOrdering with a simple sort of DexMethods in Purity.cpp. Our local build time went down from ~1600s (~27 minutes) to ~200s of which ~1000s was due to the WTO. Adding a second iteration makes the time go up to 250s. Adding Constant Propagation and LocalDCE ups the time to ~700s.

EDIT: Our CI time went down from ~7550s (!) to ~80s in the minimal case (vs. ~200s locally). EDIT2: Seems like in the CI build, WTO takes ~7450s (init_method_barriers takes ~7500s) and the rest of the pass takes ~100s. Seems like our CI just really doesn't like WTO. Probably memory thrashing of some sort going on.

Strangely, it seems like replacing the WTO with a simple sort has caused a correctness issue and I'm not sure how that happened. Adding in runtime asserts also caused an issue since some of the Pure Methods, such as Long.valueOf, are not actually pure and will fail == checks between old and new locations. I'm going to see if I can't modify the assertion to Object.equals() if the type is a ref.

justhecuke avatar Mar 09 '23 20:03 justhecuke