dune icon indicating copy to clipboard operation
dune copied to clipboard

[pkg] Passing `-j` to `dune build` causes the compiler package to be rebuilt

Open gridbugs opened this issue 5 months ago • 12 comments

Expected Behavior

After installing a project's dependencies, rebuilding the project with -j should not cause dependencies to be rebuilt.

Actual Behavior

Every time the project is rebuilt with a new degree of parallelism, all of its dependencies are rebuilt (including the compiler which takes several minutes to rebuild).

Reproduction

In any project using dune package management:

  1. Build the project
  2. Build the project again with -j1

All dependencies should be rebuilt. This might be easier to observe if you set DUNE_CONFIG__PKG_BUILD_PROGRESS=enabled to make dune print the name of each package that gets built.

gridbugs avatar Jul 30 '25 07:07 gridbugs

The problem happens because targets of rules whose actions refer to variables are rebuilt when those variables change. In the majority of cases this is desirable because the contents of the target would depend on the value of variables. Passing -j to dune has the affect of setting the value of the ${jobs} variable expanded in package build/install commands, and in the case of the %{jobs} variable, when packages do make -j %{jobs} or dune build -j %{jobs} the value of %{jobs} shouldn't meaningfully affect the contents of the contents of the target.

It would be difficult to change dune's behaviour such that changing the value of %{jobs} doesn't cause targets to be rebuilt, and I think allowing some variables to change without triggering rebuilds could lead to confusing behaviour, so I don't recommend this approach anyway. An alternative I prefer is to change the behaviour of -j so that it doesn't affect how packages are built, and set the ${jobs} variable to something fixed such as the number of cores on the current machine.

gridbugs avatar Aug 05 '25 07:08 gridbugs

We need to put a different place holder that is evaluated at runtime. That way we can throttle the job number per job.

Alizter avatar Aug 05 '25 08:08 Alizter

It would be difficult to change dune's behaviour such that changing the value of %{jobs} doesn't cause targets to be rebuilt, and I think allowing some variables to change without triggering rebuilds could lead to confusing behaviour, so I don't recommend this approach anyway

How difficult, do we suppose? E.g., on a complexity scale from 1 to 5, how complex of a change do we think this would be? Could sketch out what (roughly) would be involved?

I would guess that there a few values of this sort, where they should be able to change without actually requiring a rebuild (e.g., https://github.com/ocaml/dune/issues/12131), so perhaps having a general purpose fix for this makes sense?

shonfeder avatar Aug 12 '25 23:08 shonfeder

One idea would be to adhere to the GNU job server specification. If Dune can understand the job server protocol then we can use it to dispatch jobs to both make and dune instances that are running. Many other build systems such as cargo, bazel, ninja already have some interoperability.

The implementation is quite simple, modulo some details we have to be careful about. We essentially write some characters to a named pipe that we then pass around to make and dune. Both clients would take a character from the pipe in order to spawn a job and then give it back when the job is finished. That way all the build systems can share the job slots.

This is not without a few caveats however. We have to make sure our implementation is sound and that it does not create deadlocks when a job hogs a slot and doesn't respond to a ping.

That would probably be the ideal solution in this case, however on @shonfeder's complexity scale it would probably be a 4. It is not something to be implemented lightly, since we would have to do some hard thinking about how to get it to work with dune's scheduler. I am confident that it would be possible, but it is definitely a nontrivial amount of work.

The other solution would be to pass some other value instead. This would allow us to make progress on package management as a whole without being blocked by this issue. One way to solve this would be to pass -j on it's own to invocations of make. And not pass anything to invocations of dune.

Alizter avatar Aug 13 '25 10:08 Alizter

We have to make sure our implementation is sound and that it does not create deadlocks when a job hogs a slot and doesn't respond to a ping.

Unfortunately, that's impossible.

We should wait until we get rid of dune exec'ing into itself first before tackling this issue. It will no longer be nearly as relevant once it is addressed.

rgrinberg avatar Aug 13 '25 10:08 rgrinberg

Assuming we all agree to follow @rgrinberg's advice on sequencing, I'm unassigning @gridbugs, since this can't be worked on yet.

We should wait until we get rid of dune exec'ing into itself first before tackling this issue. It will no longer be nearly as relevant once it is addressed.

Where is this needed work being tracked?

shonfeder avatar Sep 01 '25 15:09 shonfeder

@shonfeder https://github.com/ocaml/dune/issues/10019

Alizter avatar Sep 01 '25 18:09 Alizter

Circling back here, @rgrinberg

We should wait until we get rid of dune exec'ing into itself first before tackling this issue. It will no longer be nearly as relevant once it is addressed.

could you note briefly how this will address or be a prerequisite for the problem.

shonfeder avatar Sep 02 '25 14:09 shonfeder

If dune is able to evaluate the rules in downloaded packages, it will just reuse the same throttle that it uses for all jobs (the one that it initialized with) for these packages. That throttle's value is already excluded from the action digests.

rgrinberg avatar Sep 02 '25 17:09 rgrinberg

For context and to inform motivation, the reason I hit this was because I was running tests with

-j1 --no-buffer

as per https://dune.readthedocs.io/en/stable/faq.html in order to debug a hanging test.

shonfeder avatar Sep 11 '25 23:09 shonfeder

We have added this back to the MSP, since @Alizter pointed out that this can create significant problems for CI systems, which become unable to use cached dependencies when running pipelines with varying numbers of jobs.

shonfeder avatar Oct 03 '25 23:10 shonfeder

While we still think this is important, we are afraid that the fix here will be more complex that we can manage within our target for the MSP.

shonfeder avatar Nov 07 '25 04:11 shonfeder