rules_haskell icon indicating copy to clipboard operation
rules_haskell copied to clipboard

Implement persistent workers for haskell_module

Open facundominguez opened this issue 3 years ago • 0 comments

Persistent workers allow to save startup overheads in the compiler during a build. Startup overheads are significant when using haskell_module because the compiler is invoked once per module in a project.

This issue is about implementing a persistent worker that can be used by the haskell_module rules. While we already have a persistent worker prototype written for the haskell_library/binary/test rules in #954.

How to measure

One way to measure this overhead is to build a library sufficiently large with haskell_module and --spawn_strategy=standalone --jobs=1, and compare it with how long it takes to build it with a single invocation of ghc.

In a library with 40 modules we have observed that the startup cost can be attributed nearly 25% of the total build time (if the sandboxing overhead is eliminated).

Work in progress

There is already a persistent worker prototype for haskell_modules in https://github.com/tweag/rules_haskell/pull/1755

Known issues

I asked about some of these in this post to the ghc-devs list but at the time of this writing it didn't have a reply.

Moving package dbs

When building libraries in different configurations, bazel might present the same package databases at different file system paths to the persistent worker. When looking up interface files, the persistent worker will try to pick them from the old locations in its cache, which are by now removed from the sandbox.

This problem can be solved by using different persistent workers to build different configurations of libraries.

At first I tried to achieve this with singleplex workers. Each library configuration gets a separate set of persistent workers. But this brought the problem that bazel can only limit the amount of workers in each library configuration, but not the total amount of workers. Which causes many workers to linger around when they are no longer necessary, which consumes the scarce physical memory on builds with many concurrent jobs.

The current implementation of a multiplex worker is intended to address this by having each build use a single multiplexed worker, which spawns multiple background workers for each library configuration, and that maintains an LRU cache to take down the background workers that have been idle the most.

Multiplexed sandboxing is handled by changing the working directory of a background worker to the appropriate sandbox when giving it a request. In theory bazel 6 supports multiplexed sandboxing, but in some preliminary tests inputs appear to be missing from the sandbox, and running the find command on the sandbox during the action produced mysterious error messages about missing files (also inputs to the action).

The fix that probably is going to require the least sophistication is to implement in bazel limiting the total amount of workers with its own LRU cache.

Dynamic membership of modules to libraries

Another difference between requests is that, even for a same package database, different interface files are present, depending on what the module under compilation imports transitively. The persistent worker sometimes tried to load interface files that aren't in the transitive dependencies and this caused a compilation failure because the module is missing from the sandbox.

There are a few uncertainties to resolve:

  • Firstly, it is uncertain whether GHC can tolerate this shifting membership of interface files to packages. Presumably, GHC can be modified to only load interface files that are in the transitive dependencies of the module being compiled. But then,
  • it is uncertain if we can fix the GHC behavior from the GHC API or we will need to patch the compiler. And finally,
  • we don't know yet the conditions under which GHC loads these seemingly unnecessary interface files. We need to sidestep other errors first before we can get to reproducing this issue reliably. So far, I saw it appear a few times when building multiple libraries in one build command with sandboxing, and I wasn't able to reproduce it in the small.

One feature that could simplify this story is if bazel implemented a sort of additive sandboxing. A persistent worker gets a sandbox in which inputs are exposed, but they are never removed. Whenever a new compilation request arrives, new inputs are added, and inputs for old compilation requests are kept for as long as the persistent worker is alive. This, however, compromises hermeticity to work around issues in the compiler, so it may make sense to investigate further the compiler issues first.

Loading object files and libraries multiple times

If multiple compilation requests require to load libraries or object files for Template Haskell, sometimes the GHC API complains that these libraries and object files have already been loaded. We need to either discover where GHC keeps the list of loaded artifacts or keep our own to avoid loading them more than once.

facundominguez avatar Jun 09 '22 12:06 facundominguez