ocaml icon indicating copy to clipboard operation
ocaml copied to clipboard

follow-up on #13150: improve the closure-computation algorithm again

Open gasche opened this issue 1 year ago • 15 comments

The Flambda backend contains a transitive-closure computation algorithm that misbehaves on certain Menhir-generated programs. #13150 introduced an improved algorithm that fixes the case known to misbehave, but is asymptotically disappointing in other cases -- see the discussions there for the details. The cases where #13150 performs worse than it should have not been found in the wild yet, but were exhibited by the simple relations in the benchmark script proposed by @fweimer-rh. I was nerd-sniped and decided to tweak the algorithm to perform better.

The PR improves the asymptotic behavior of the closure-computation by ensuring that closure-computation work is shared between nodes instead of being duplicated.

Consider for example the relation

a -> b b -> c c -> a

The previous code would first compute

a -> [b, c, a]

by traversing the graph, then

b -> [c, a, b]

by traversing the graph again, then

c -> [a, b, c]

again. The new code traverses the graph until it finds a cyclic node (starting from [a] this would be [a]), then compute

a -> [a, b, c]

using the same algorithm as before, and then immediately computes the result for c and b from the result of a, without any additional traversal.

Performance results using the benchmark script of Florian Weimer ( https://gitlab.com/gasche-snippets/transitive-closure ).

Implementations:

  • old is the code in 5.2
  • new is #13150
  • again is the algorithm in this commit

Tests:

  • isolated is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...)
  • cycle is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1)
  • complete is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...)
isolated Mean [ms] Relative
old 162.1 ± 6.3 1.66 ± 0.08
new 97.7 ± 2.9 1.00
again 203.1 ± 4.1 2.08 ± 0.07
cycle Mean [s] Relative
old 1.952 ± 0.054 596.91 ± 154.38
new 0.034 ± 0.002 10.29 ± 2.71
again 0.003 ± 0.001 1.00
complete Mean [ms] Relative
old 59.3 ± 5.3 8.46 ± 1.38
new 218.5 ± 15.8 31.15 ± 4.83
again 7.0 ± 1.0 1.00

The new code has a larger constant factor than the #13150 algorithm for isolated graphs, which shows as a 2x slowdown. This makes it slightly slower than the "old" code in 5.2 in this trivial case.

It performs much better than both trunk and 5.2 in the other cases.

gasche avatar May 09 '24 09:05 gasche

One may reasonably wonder whether it is worth having an algorithm that is faster in the "hard" cases, if it has worse constant factors in the "easy" case (2x slowdown). Let me explain why I think it is worth it.

  1. The instances in human-written code are both "easy" and very small, so the cost of this closure computation is neglectible compared to the rest of the compilation, and the constant factor does not matter.

  2. The issue with this function is thus not its computation time in the easy case, but the lack of robustness that arise from bad asymptotic behavior in harder cases. Performing asymptotically well is more important than being fast.

  3. The current algorithm in trunk is inefficient in situations that very plausibly occur in generated code: think of N mutually-recursive functions that all share an invariant first parameter: let rec f env ... = ... and g env ... = ... and .... In this case the current algorithm will perform N times the work of the algorithm in this PR.

gasche avatar May 09 '24 09:05 gasche

@fweimer-rh I wonder if you would be willing to give a look at the code and give your impression on the complexity/performance tradeoff.

gasche avatar May 15 '24 10:05 gasche

@fweimer-rh I wonder if you would be willing to give a look at the code and give your impression on the complexity/performance tradeoff.

@gasche I extracted the previously problematic relation graph from the Coccinelle/Menhir build, and your implementation is indeed much faster than what's in trunk today (one second versus 5.7 seconds). I'll try an array-based implementation and will report if it is faster.

fweimer-rh avatar May 18 '24 22:05 fweimer-rh

I had some discussions with @chambart earlier this week about this issue (the original #13150, not this new algorithm), and we suspect that for menhir-generated parsers we should simply not compute the transitive closure at all (we expect that very few parameters will be invariant or unused, and for the invariant parameters it's unlikely we will have any opportunity to actually make use of that). However, we didn't find a good way to detect this case so we didn't make any concrete proposals. Some solutions we thought about:

  • a dedicated flag passed to the compiler to warn about big recursive nests (impractical because dune doesn't make it easy to pass per-file flags, although since it knows about menhir it might be able to do it automatically for generated parsers)
  • a cut-off in the compiler that only computes transitive closures for "small" nests (say, less than 20 functions)

I don't have any code for these suggestions at the moment, but if someone is interested in testing them I could put it on my todo list.

lthls avatar May 19 '24 10:05 lthls

A note that per-file flags are a big pain for downstream packagers. You have to remember that Dune is not obvious for people who don't work on OCaml daily. I use OCaml a lot but I have no idea how to add a per-file flag via Dune. Someone coming to an OCaml project as an outside downstream packager just thinks Dune is weird. Not to mention there are many other build systems used by OCaml projects, and a lot of half-baked hand-written Makefiles, which makes everything harder.

rwmjones avatar May 19 '24 10:05 rwmjones

The per-file flags could be made to work independently of the build system with the introduction of unit headers (https://github.com/ocaml/RFCs/pull/26). For example menhir would add at the start of the generated file an attribute disabling this optimisation, and the compiler would read this attribute and change its settings accordingly. Otherwise I agree that it's not realistic (as far as I know it isn't even possible in dune, you would have to move the files to independent libraries).

lthls avatar May 19 '24 11:05 lthls

I hadn't thought of this argument for unit headers: "easier than figuring out how to pass per-file flags with Dune. Genius!

@lthls hopefully this discussion motivated you to review this PR :-)

(The implementation uses persistent maps and sets just like the original code -- I thought that it would offer a fairer comparison. I am happy to start using Hashtbl as suggested by @fweimer-rh to see if it improves constant factors, now that we know that there exists a real program that is affected by those constant factors.)

gasche avatar May 19 '24 12:05 gasche

(I tried using hashtables for the state map and visited set, and the performance gains are very small on the synthetic benchmarks -- 30% faster for "isolated" relations, and the same speed for all others. It is not worth the small extra complexity of using different data structures to represent the same data.)

gasche avatar May 19 '24 12:05 gasche

This PR is sitting in a suspended state due to a lack of reviewers, despite demonstrating an improvement over the current trunk on realistic use-cases. Any volunteers to review the implementation for correctness? (This is standard algorithmic stuff, the extra function is a graph traversal that remembers which nodes it already visited to avoid/detect cycles.) (I'm sure @lthls knows plenty of people who could do this for breakfast.)

gasche avatar Jun 05 '24 12:06 gasche

In my opinion this is very low priority. It's only a marginal improvement for a very marginal use-case that we hope will disappear in the not-too-distant future. I'll still mention it to people around me, some of them love theoretical questions that have little practical impact.

lthls avatar Jun 05 '24 12:06 lthls

I found out by chance (reading the make build logs) that there is a utils/strongly_connected_components.ml module. If no one is willing to review my lovely code, I could maybe try to understand whether this makes the problem at hand much simpler -- intuitively it should be both easy and efficient to compute the transitive closure from a SCC decomposition of the relation.

In my opinion this is very low priority. It's only a marginal improvement for a very marginal use-case that we hope will disappear in the not-too-distant future. I'll still mention it to people around me, some of them love theoretical questions that have little practical impact.

For the record, I think that you have the wrong perspective. Coccinelle is an important tool in the Linux development community, possibly the only OCaml program that most Linux hackers will ever use. Enabling flambda (the codebase that the "people around [you]" are responsible for) makes the build time for Coccinelle go from "a few seconds" to "34 minutes". This sucks, and it was annoying enough to make Red Hat engineers implement a patch to the OCaml compiler, that went back to "a few seconds" territory. I did the review work myself, and this huge issue mostly went away thanks to their contribution. The PR I propose reduces the bottleneck further, "moving from 5.7 seconds to one second" according to their test. I wouldn't say this is a world-changing patch, but it is nothing to sneer at.

The alternative that you propose is unclear; either wait until flambda is gone for good, or maybe wait for you and Pierre to implement an alternative approach for flambda1, but you're not really sure what would be a good design and no one seems motivated to implement it. That does not sound like a very serious proposal to me, or maybe we don't have the same definition of "not-too-distant future".

gasche avatar Aug 01 '24 09:08 gasche

I'm proposing to disable this closure computation in cases with too many functions. @chambart told me that it's very unlikely we will discover anything useful here, particularly on menhir-generated parsers.

lthls avatar Aug 01 '24 09:08 lthls

Is someone planning to submit a PR that does this?

gasche avatar Aug 01 '24 09:08 gasche

To clarify a bit: the main goal of this piece of code is to discover invariant parameters. If we have invariant parameters in a set of recursive functions, then it means that occurrences of these functions can be specialised to the invariant parameter. (Typically, List.map has the function as an invariant parameter, so if we see it applied to a known function we can specialise it, which allows us to inline the mapped function if we want to). For big sets of functions, this is much less interesting because you have to duplicate a lot more code (all of the other functions in the recursive set) when you specialise one function. Moreover, we suspect that huge sets of recursive functions are less likely to feature invariant parameters (we haven't run actual experiments though). So for these reasons, I believe that approximating the result of the analysis with a sound-but-not-complete approximation (no invariant parameters) when the set of functions grow beyond a certain size (10 functions, maybe ?) is likely a very good compromise. I'm on holidays until the 12th, at which point I can look at submitting a PR.

lthls avatar Aug 01 '24 09:08 lthls

Is someone planning to submit a PR that does this?

Done in #13379.

lthls avatar Aug 14 '24 15:08 lthls

@gasche I'm considering reviewing this. I have already noted that I'm not moved by the performance argument, which I think is better solved by my own PR, but I'm fine with accepting this PR on other grounds. Would you like to see this make progress even if performance is irrelevant ?

lthls avatar Aug 30 '24 13:08 lthls

@lthls I think that your PR is better than the present PR because it is simpler -- so let's close this one. If we want to improve closure computation (which is not worth it if after your PR is merged, unless it is used in other places), I would go for trying to reuse the strongly-connected-components component instead of the current PR, I think it would be even more efficient and reduce overall complexity.

gasche avatar Aug 30 '24 16:08 gasche