rules_haskell icon indicating copy to clipboard operation
rules_haskell copied to clipboard

Support fine grained module level build definitions

Open aherrmann opened this issue 3 years ago • 37 comments

Goal

Add a rule haskell_module to support fine grained build definitions at the level of individual Haskell modules.

Motivation

The existing haskell_library|binary|test rules operate on the level of Haskell packages. In these rules building consist of two main steps with corresponding Bazel build actions: Compile all modules of a package in one build action and link all modules together into a library in a separate action (one for static and one for dynamic linking). Additionally, there is an action to construct the package's package database and several supporting actions.

This means that all modules of such a package need to be recompiled even if only a single module in a package changed compared to a previous cached build.

With a haskell_module rule we could instead issue one compile action per Haskell module. This would allow us to benefit from Bazel's caching and parallel builds on the level of individual modules.

Outline

Not all the details are worked out at this stage, however, the idea is that the haskell_module rule may be used like this:

haskell_module(
    name = "ModuleA",
    src = "ModuleA.hs",
    deps = ["@stackage//:base"],
)
haskell_module(
    name = "ModuleB",
    src = "ModuleB",
    deps = [
        "@stackage//:base",
        ":ModuleA",
    ],
)
haskell_library(
    name = "my-library",
    modules = [
        ":ModuleA",
        ":ModuleB",
    ],
    deps = ["@stackage//:base"],
)

In this example the haskell_module rules perform the compilation of modules to object and interface files, and the haskell_library rule performs the linking of these compiled modules into a static archive or shared object and the generation of a package database for the package.

Alternatives considered

The goal is to provide what we called "sub-target incremental builds" in the context of the current haskell_library|binary|test rules. I.e. avoid recompiling modules that have not changed between builds even if another module that is part of the same library or executabe target did change. An alternative approach that we considered is to use Bazel's persistent worker mechanism to provide a persistent GHC session that caches previous compilation outputs (object and interface files). Downsides of this approach are that:

  • It is a more complex solution, since it requires maintaining our own cache and maintaining state across worker invocations.
  • It introduces potential inhermeticity, since the worker would persist state across build actions outside of Bazel's control.
  • The benefits are limited to a worker instance, i.e. modules cannot be shared across build nodes through a remote cache.

Anticipated problems

Slower uncached builds

Starting one GHC session per module may well be slower than starting a single GHC session that compiles multiple modules at once. Meaning there is a tradeoff between uncached, builds from scratch and incremental, cached builds. For everyday development the incremental, cached build is likely to be the more common case than uncached, builds from scratch. Meaning, the benefit may outweigh in the common case. Furthermore, we could still use a persistent worker to save on the repeated start-up of GHC sessions similar to how it is already done for Java or Scala in Bazel. In this case we would not need to maintain a cache of compilation outputs in the persistent worker, reducing the downside of this approach.

rules_haskell already contains an implementation of a persistent worker that could be adapted to this use-case.

More verbose build definitions

Haskell compilation has to occur in dependency order: The Haskell compiler requires interface files of modules that the current module depends on as an input in order to compile the current module. This means that haskell_module rules need to spell out the Haskell module dependency graph in the BUILD files. Manually maintaining this dependency graph in both the Haskell sources and the BUILD files is tedious. BUILD file generation, e.g. a Gazelle extension for Haskell, can solve this issue.

aherrmann avatar Jun 03 '21 10:06 aherrmann

The implementation of this feature has progressed (#1553, #1564, #1565, #1568, #1579, #1590) and we have discovered some design questions with the API that need to be resolved.

Haskell module compilation is trickier than, say, C module compilation, because we need metadata about the final library to compile the individual module, most often -this-unit-id which needs to know the name of the package that the module is part of. So, there is a reverse dependency going from the haskell_library back to the haskell_module.

Extremes

It's a large design space, but, I think it can help to consider two extremes in-between which haskell_module could lie:

  • Either haskell_module is very explicit and acts like a rule in a Makefile. I.e. it is fully explicit about all compilation flags and produces a .o and .hi file itself.
  • Or it is more abstract and only defines the inter-module dependency graph and the actual compilation is deferred until the module is used in a haskell_library|binary|test target.

Explicit

The explicit extreme could look something like this:

haskell_module(
    name = "Module.A",
    src = "Module/A.hs",
    deps = [
        "@stackage//:base",
    ],
    # Package name and version - duplicated across modules and library target
    # Relevant for flags such as -this-unit-id, -optP-DCURRENT_PACKAGE_KEY
    package_name = "mypackage",
    package_version = "0.1.0.0",
    # common flags - duplicated across modules
    # Relevant for compilation
    ghcopts = ["-XOverloadedStrings"],
)
haskell_module(
    name = "Module.B",
    src = "Module/B.hs",
    deps = [
        "@stackage//:base",
        ":anotherpackage",
        ":Module.A",
    ],
    package_name = "mypackage",
    package_version = "0.1.0.0",
    ghcopts = ["-XOverloadedStrings"],
)
haskell_library(
    name = "mypackage",
    package_name = "mypackage",  # optional: defaults to name
    version = "0.1.0.0",
    modules = [":Module.A", ":Module.B"],
    # We need to repeat package deps for the package-db entry.
    deps = [
        "@stackage//:base",
        ":anotherpackage",
    ],
    ghcopts = [...],  # only linking options go here
)

Where bazel build //:Module.A produces bazel-bin/Module/A.o (or similar).

Note, many attributes are repeated across modules and library, e.g. package name or ghcopts.

Abstract

The abstract extreme could look something like this instead:

haskell_module(
    name = "Module.A",
    src = "src/Module/A.hs",
)
haskell_module(
    name = "Module.B",
    src = "src/Module/B.hs",
    deps = [
        ":Module.A",
    ],
)
haskell_library(
    name = "mypackage",
    package_name = "mypackage",  # optional: defaults to name
    version = "0.1.0.0",
    modules = [":Module.A", ":Module.B"],
    deps = [
        "@stackage//:base",
        ":anotherpackage",
    ],
    ghcopts = ["-XOverloadedStrings"],
)

Where bazel build //:Module.A produces no output. Instead the module is only built when //:mypackage is built. (There are several ways to achieve this: transitions, aspects, compile inside haskell_library. But, let's focus on the API side first.)

In this case haskell_library collects attributes like package name or ghcopts for all modules. (This approach could still allow for per-module ghcopts as well, if deemed necessary.)

Pros & Cons

Explicit

A clear advantage of the explicit approach is that it is very simple to understand.

A clear downside is that it requires a lot of very mechanical duplication. This is perhaps less of a concern when using a Gazelle extension to generate the haskell_module targets. But, it's still very verbose.

Another downside is that the package_name|version attributes, used to generate -this-unit-id, mean that this target can only be used at one haskell_library. If, for whatever reason, one wants to include the module in two separate haskell_library targets, then one needs two separate haskell_module targets as well.

It's also easy to make mistakes, such as using a haskell_module in a haskell_library with a mismatching package-name (ideally haskell_library should raise an error in this case).

Abstract

The abstract approach on the other hand is much less verbose and the haskell_module targets are more high-level: They declare the module dependencies but hide details like -this-unit-id.

The downside is of course that it is less explicit: It's not the haskell_module alone that produces a .o|.hi, but only the combination of haskell_module and haskell_library.

Targets that don't themselves produce anything seem a bit strange. This is not without precedent though, proto_library targets are similar. They themselves only declare the interdependency between proto files, but don't generate any bindings. Only when paired with one of the <lang>_proto_library rules will Bazel generate any output. In that case typically through aspects.

Current Status

The current implementation lies a little in between both approaches. It largely follows the explicit approach, but, the package name is implicitly forwarded from the haskell_library to the haskell_module via a transition. In it's current form this has the issue that bazel build //... will build the haskell_module twice: Once without -this-unit-id when building the haskell_module target, and once more with -this-unit-id when building the haskell_library.

Going Forward

I'd argue that we should commit more closely to one of those approaches. Either, take the explicit approach and incur the need for duplication of the package name. But, have haskell_module produce the .o and .hi file by itself.

Or make haskell_module more abstract, but then also avoid compiling the code twice. I.e. haskell_module alone does not produce a .o or .hi file. If we go the abstract road then we may as well make full use of the benefits and reduce duplication by letting users declare shared attributes on the haskell_library|binary|test target.

@Radvendii @facundominguez thoughts?

aherrmann avatar Oct 12 '21 15:10 aherrmann

Or make haskell_module more abstract, but then also avoid compiling the code twice.

How do you avoid compiling twice if the module needs to be built with different -this-unit-id flags?

Also, I have some recollections of cc_library not committing to build a library until seeing how it would be linked with cc_binary.

facundominguez avatar Oct 12 '21 16:10 facundominguez

Or make haskell_module more abstract, but then also avoid compiling the code twice.

How do you avoid compiling twice if the module needs to be built with different -this-unit-id flags?

Well, if it needs to be built with different unit-ids then you can't avoid building it multiple times. But, the point is that, right now, it is being built twice even if it is only needed with one unit-id.

Also, I have some recollections of cc_library not committing to build a library until seeing how it would be linked with cc_binary.

If you build a cc_library target, either directly or bazel build //..., then it will build both the .a and .so (unless you set linkstatic=True). But, the actions that generate the .a and .so are separate actions. So, if you bazel build //:some-target that links the cc_library statically, then only the .a will be requested and built.

aherrmann avatar Oct 12 '21 17:10 aherrmann

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case (since it can't really be used by anything.

Pros:

  • bazel build //... doesn't build twice
  • the haskell_module() rule is the one doing the building

Cons:

  • The rule inconsistently builds or doesn't build depending on how it's used (this could maybe be mitigated by emiting a message / warning when it's built on it's own)

There is another option which combines all three approaches which has the advantage of enabling both approaches depending on need and the disadvantage of being more complex. This would be that haskell_module() would build if supplied a package_name attribute, and only supply the information necessary to build if not. haskell_library() would know how to deal with both types of outputs.

Now that I write it out that seems not worth the complexity

Radvendii avatar Oct 12 '21 18:10 Radvendii

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case

Do haskell_binary rules supply a package name to haskell_modules?

facundominguez avatar Oct 12 '21 18:10 facundominguez

That's a good point. We could make them do that, but that might be even more confusing.

I'm not sure I quite understand the relationship between binaries and modules. Why not bundle modules up in a library and have the binary depend on that?

Radvendii avatar Oct 12 '21 18:10 Radvendii

I'm not sure I quite understand the relationship between binaries and modules. Why not bundle modules up in a library and have the binary depend on that?

It looks feasible, but more bureaucratic than defining a binary with a single rule.

facundominguez avatar Oct 12 '21 21:10 facundominguez

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case (since it can't really be used by anything.

Pros:

  • bazel build //... doesn't build twice
  • the haskell_module() rule is the one doing the building

As I understand it, this falls into the abstract side. Building a haskell_module alone won't do anything, because no package_name is supplied. Only building a haskell_library|binary|test that depends on it will cause it to be built, because it supplies a package name. Note, the pro "the haskell_module() rule is the one doing the building" can also be achieved without this kind of detection mechanism. If haskell_module doesn't return the object in DefaultInfo, but only in HaskellModuleInfo, then building the target directly won't do anything, only building a haskell_library|binary|test that depends on it will request the object and therefore trigger its build. The build action can still be emitted by haskell_module.

Or phrased as a question: Is there an observable difference between this approach and just never returning an object in DefaultInfo so that it is only ever requested through a haskell_library|binary|test dependency via HaskellModuleInfo?

Another option is to modify what we currently have slightly by having the haskell_module() rule detect when it is being built without a package name supplied (typically from being used by a haskell_library()) and not build anything in that case

Do haskell_binary rules supply a package name to haskell_modules?

haskell_binary does not supply a package name. In that case GHC defaults to main (Side note: The language around the name main is surprisingly strong given that it can be changed with -main-is).

I'm not sure I quite understand the relationship between binaries and modules. Why not bundle modules up in a library and have the binary depend on that?

It looks feasible, but more bureaucratic than defining a binary with a single rule.

Agreed, it seems unnecessary to enforce an intermediate haskell_library.

There is another option which combines all three approaches which has the advantage of enabling both approaches depending on need and the disadvantage of being more complex. This would be that haskell_module() would build if supplied a package_name attribute, and only supply the information necessary to build if not. haskell_library() would know how to deal with both types of outputs.

Now that I write it out that seems not worth the complexity

Yeah, that sounds like adding a lot of complexity without clear benefit.

aherrmann avatar Oct 13 '21 10:10 aherrmann

If haskell_module doesn't return the object in DefaultInfo, but only in HaskellModuleInfo, then building the target directly won't do anything

Ohh I didn't realize it was this simple. This would be my vote, then, because it's the way I think I'd prefer to use rules_haskell. I tend to get very annoyed about needless duplication.

On the other hand, if the way people are expected to use haskell_module is almost always via gazelle, then I think a simpler underlying mechanism is better.

Do we know how it's going to be used in practice? I don't have a sense for how ubiquetous gazelle is.

Radvendii avatar Oct 13 '21 12:10 Radvendii

At first, I expected the abstract approach to help avoid setting ghcopts for every module of a library. But now I realize that perhaps this can be done with the explicit approach via transitions. If that is the case, then this becomes pretty much an implementation choice. I don't think it would be obvious for the user to observe the difference, unless she goes into peering at the actions that implement the rules.

I would expect the abstract approach to be simpler, because all the information is gathered in haskell_module to later build in the library or binary rules without using transitions. Transitions, in my opinion, spread the task among more concepts than we would otherwise need.

facundominguez avatar Oct 13 '21 22:10 facundominguez

At first, I expected the abstract approach to help avoid setting ghcopts for every module of a library. But now I realize that perhaps this can be done with the explicit approach via transitions. If that is the case, then this becomes pretty much an implementation choice. I don't think it would be obvious for the user to observe the difference, unless she goes into peering at the actions that implement the rules.

The way I intended that distinction this would fall closer to the "abstract" side. With the "explicit" vs. "abstract" distinction I was referring to the exposed API, not to the implementation. Whether we use transitions to achieve this or not is an implementation choice. But, as far as the user is concerned, if ghcopts (and others) are not fully declared on the haskell_module, then the module rule is not fully explicit, but is implicitly requiring parameters from the haskell_library|binary|test that depends on it in some way. So, haskell_module cannot be built in isolation.

That's what I meant above by

Where bazel build //:Module.A produces no output. Instead the module is only built when //:mypackage is built. (There are several ways to achieve this: transitions, aspects, compile inside haskell_library. But, let's focus on the API side first.)


I would expect the abstract approach to be simpler, because all the information is gathered in haskell_module to later build in the library or binary rules without using transitions. Transitions, in my opinion, spread the task among more concepts than we would otherwise need.

As stated above, I think this now discusses the implementation more than the API design. I assume that you mean an implementation where the build actions are not emitted by haskell_module but only later by haskell_library. Yes, I agree that this is simpler. Transitions introduce a fair bit of complexity. I'm also concerned that https://github.com/bazelbuild/bazel/issues/13281 might come bite us in this case.

aherrmann avatar Oct 14 '21 09:10 aherrmann

I don't see much benefit in the explicit API. If a module needs to be compiled with specific ghcopts, why not use the OPTIONS_GHC pragma instead? Are there any other attributes that would be desirable to specify per module?

facundominguez avatar Oct 14 '21 14:10 facundominguez

Are there any other attributes that would be desirable to specify per module?

Plugins come to mind, as discussed in https://github.com/tweag/rules_haskell/pull/1566#discussion_r673414169 in the past. That said, the "abstract" approach does not preclude that option.

aherrmann avatar Oct 14 '21 16:10 aherrmann

IIUC, the advantage of the explicit approach is that it is simpler to understand and implement, meaning (potentially) fewer bugs. Is that right? Is there anything else to it?

Radvendii avatar Oct 14 '21 21:10 Radvendii

IIUC, the advantage of the explicit approach is that it is simpler to understand and implement, meaning (potentially) fewer bugs. Is that right? Is there anything else to it?

At this point I'm totally confused about whether we are discussing the implementation or the interface. :)

If we are discussing the interface, then the "abstract" approach is to allow haskell_library and haskell_binary rules to influence what compilation options are used to build modules. The "explicit" approach does not allow this (.i.e. all the compilation options need to be included in the haskell_module rule).

Both interfaces can be implemented by deferring or not deferring compilation actions to the haskell_library and haskell_binary rules. Of the mix of interfaces and implementation:

  1. explicit interface - deferred actions: simple implementation
  2. explicit interface - non-deferred actions: simplest implementation
  3. abstract interface - deferred actions: simple implementation
  4. abstract interface - non-deferred actions: requires transitions

I think the easiest to use is probably (3), and the simplest to implement is (2).

facundominguez avatar Oct 14 '21 21:10 facundominguez

@Radvendii

IIUC, the advantage of the explicit approach is that it is simpler to understand and implement, meaning (potentially) fewer bugs. Is that right? Is there anything else to it?

Yes, I agree. It is the only option that allows the haskell_module to be built in isolation in a meaningful way, i.e. bazel build //:My.Module produces a .o and .hi file, such that those are the same files that a haskell_library or similar depending on the module will use.


@facundominguez

At this point I'm totally confused about whether we are discussing the implementation or the interface. :) If we are discussing the interface [...]

Yes, the interface.

Thanks for good summary, I agree. I'd say option 1 is not something we'd want since it incurs the verbosity of the explicit interface, but doesn't provide the benefit of a target that can be built in isolation as described above.

aherrmann avatar Oct 15 '21 10:10 aherrmann

Yeah, good summary @facundominguez that helped clear things up in my head.

For reference, (4) is what we currently have implemented. I agree that (2) and (3) have significant advantages though.

I don't really have a good grasp on using rules_haskell, so I don't have strong opinions on which is better.

Radvendii avatar Oct 18 '21 14:10 Radvendii

I think the abstract interface is preferable. From a user's perspective it provides a higher level and more declarative interface: The user defines the module dependency graph, not the mechanics behind building individual modules. The explicit approach not just introduces more tedious duplication, it also increases the potential for user errors, e.g. by trying to bundle modules with different package names into the same library target - something that rules_haskell would then have to detect to produce a meaningful error message.

Regarding implementation, i.e. the choice between (3) or (4), I think (3) is preferable. I did a test to see if we are affected by https://github.com/bazelbuild/bazel/issues/13281, and we are indeed. I've pushed an example that triggers the issue in https://github.com/tweag/rules_haskell/commit/acb030b0144ce21012fbc94595b0209b2a67f223. This constructs a diamond shaped dependency graph like this

         --> Module.A --> lib-a -->
lib-root                            test
         --> Module.B --> lib-b -->

Meaning test directly depends on lib-a, and lib-b, and transitively on the modules and lib-root.

In this case we'd want lib-root to only be built once, and the same objects and interfaces to be used for all dependents. But, that is not what we observe:

$ bazel clean; bazel test //tests/haskell_module/diamond:test --execution_log_json_file=execlog.json
$ jq 'select(.mnemonic == "HaskellBuildLibrary")|{progressMessage, outputPaths: [.actualOutputs[].path]}' <execlog.json
{
  "progressMessage": "HaskellBuildLibrary //tests/haskell_module/diamond:lib-root",
  "outputPaths": [
    "bazel-out/k8-fastbuild-ST-2e826c114ca7/bin/tests/haskell_module/diamond/_obj/lib-root/Module/Root.o",
    ...
  ]
}
{
  "progressMessage": "HaskellBuildLibrary //tests/haskell_module/diamond:lib-root",
  "outputPaths": [
    "bazel-out/k8-fastbuild-ST-3ce0d30d28dc/bin/tests/haskell_module/diamond/_obj/lib-root/Module/Root.o",
    ...
  ]
}
{
  "progressMessage": "HaskellBuildLibrary //tests/haskell_module/diamond:lib-root",
  "outputPaths": [
    "bazel-out/k8-fastbuild/bin/tests/haskell_module/diamond/_obj/lib-root/Module/Root.o",
    ...
  ]
}

Meaning the library is actually built three times. That's precisely the issue described in https://github.com/bazelbuild/bazel/issues/13281. So, transitions are not a good approach in this case.

EDIT I noticed that I linked to the wrong commit. The fixed example is available at cb567e4286317b85615705670af1157433a4b501. The corresponding commands are

$ bazel clean; bazel test //tests/haskell_module/generated:test --execution_log_json_file=execlog.json
$ jq 'select(.mnemonic == "HaskellBuildLibrary")|{progressMessage, outputPaths: [.actualOutputs[].path]}' <execlog.json

(Tested with Nix and direnv installed and direnv enabled in the repository.)

aherrmann avatar Oct 20 '21 10:10 aherrmann

I'm currently trying to implement (3), however, it feels like going against the current so far.

The task is creating the compile actions for the modules, given the descriptions/targets of the haskell_module rules that have been deferred.

My first try was making a recursive function, for every haskell_module target, the function would create first the actions to compile its dependencies, and then it would create the action for the haskell_module target.

The above is very simple, but it doesn't work because bazel starlark doesn't allow recursive functions!

My next attempt was to encode the recursion with an explicit stack. This seems to work, but since starlark doesn't offer while loops, I resorted to a for loop over a very long range, that is terminated early when the stack-encoded recursion finishes.

Is this the right way to go?

Here's the function: https://github.com/tweag/rules_haskell/blob/a7d412ea0cd5ff051bbdacb86595e42d4f8bc004/haskell/experimental/private/module.bzl#L236-L298

facundominguez avatar Nov 02 '21 19:11 facundominguez

And here's another question: if a module is used in two haskell_binary rules, and each binary rule creates its own action to compile the module, we should expect the module to be compiled twice and there is little we can do to avoid this, right?

This is something that affects erasing library boundaries when defining haskell_module rules. Before, if modules coming originally from libraries ended up used in multiple binaries, they wouldn't be rebuilt. But this doesn't appear to be the case anymore with deferred actions.

facundominguez avatar Nov 02 '21 20:11 facundominguez

My first try was making a recursive function, for every haskell_module target, the function would create first the actions to compile its dependencies, and then it would create the action for the haskell_module target. The above is very simple, but it doesn't work because bazel starlark doesn't allow recursive functions!

Correct, Starlark does not allow recursion, or unbounded loops in general. That said, I don't think this is required in this case.

One difficulty seems to be to calculate the set of transitive module dependencies in haskell_library|... rule implementation. Fortunately, you don't have to do this in that place. Instead you can let Bazel do this for you on the level of the haskell_module rule by returning a field for transitive dependencies in the HaskellModuleInfo provider.

Another important feature of Bazel is that the build actions don't need to be emitted in any particular order. So, you could do something along the lines of this (just a sketch):

# Declare module outputs
module_outputs = {
    mod.label: struct(
        interface = ctx.actions.declare_file(...),
        object = ctx.actions.declare_file(...),
    )
    for mod in modules
}

# Declare build actions
for mod in modules:
    outputs = module_outputs[mod.label]
    inputs = []
    for dep in mod.transitive_module_dependencies:
        inputs.append(module_outputs[dep.label].interface)
        inputs.append(module_outputs[dep.label].object)
    _run_ghc(
        outputs = outputs,
        inputs = inputs,
        ...
    )

And here's another question: if a module is used in two haskell_binary rules, and each binary rule creates its own action to compile the module, we should expect the module to be compiled twice and there is little we can do to avoid this, right?

Yes, that is true. Given the odir, hidir constraint, i.e. that all modules of a library or binary need to have their interfaces and objects in the same directory, I don't see much that we can do about this.

aherrmann avatar Nov 03 '21 09:11 aherrmann

When trying to decide what data structure to use to collect transitive dependencies of haskell_module, I'm not finding an obvious choice.

This is mod.transitive_module_dependencies in Andreas's sketch. Semantically it would be a set of Targets. Bazel recommends using depset for collecting values transitively, but I imagine that checking for Target equality is overkill, since comparing only the labels would be necessary.

But the other alternatives also seem to impose some merging overhead as explained in the documentation. https://docs.bazel.build/versions/main/skylark/depsets.html

Looks like the most efficient data structure would be depmap, something like a depset but dedups by looking at a key instead of looking at the values.

Any thought's on how to proceed here?

facundominguez avatar Nov 03 '21 12:11 facundominguez

A depset seems like the best data structure and I'm not aware of a depset of Targets being a particular performance concern. As far as I know depset doesn't use value equality, but compares hashes instead. I'm not sure how exactly this works in the case of depsets of Targets off the top of my head, but, there are cases in the Bazel code base where the hash code of a target is calculated based only on its label and configuration key and also avoids frequent re-computation of the hash by caching it. In other words, I'd expect it to behave similarly to the depmap you describe.

The more concerning part, performance-wise, is that we need to flatten the depset inside haskell_library|... to generate all the build actions. However, I don't really see a way around this, and so long as there is no overlap of modules between haskell_library|... targets it also wouldn't be quadratic.

aherrmann avatar Nov 03 '21 13:11 aherrmann

The more concerning part, performance-wise, is that we need to flatten the depset inside haskell_library|... to generate all the build actions. However, I don't really see a way around this, and so long as there is no overlap of modules between haskell_library|... targets it also wouldn't be quadratic.

We have to flatten the transitive dependencies once per compile action for a module in order to produce the list of input interface files. This does look quadratic on the amount of modules in a library. Aren't we better in this respect with the stack-encoded recursion?

facundominguez avatar Nov 03 '21 16:11 facundominguez

We have to flatten the transitive dependencies once per compile action for a module in order to produce the list of input interface files. This does look quadratic on the amount of modules in a library.

Ah, yes, you're right. I was looking at it the wrong way.

This reminds me that we have a somewhat similar situation with the Haskell toolchain libraries, implemented here. In that case we exploit the fact that a depset can give us the dependencies in a well defined order.

I think something like this could be used here as well. If we can iterate over the modules in the right order, such that A appears before B if B depends on A, then we can avoid iterating over each module's transitive dependencies. Another sketch illustrates what I have in mind:

# Declare module outputs
module_outputs = {
    mod: [
        ctx.actions.declare_file(...),  # interface
        ctx.actions.declare_file(...),  # object
    ]
    for mod in modules
}

# List of modules in post-order
# Meaning, if module B depends on module A, then A will appear before B.
ordered_modules = depset(transitive = [
    mod[HaskellModuleInfo].transitive_module_deps  # Created with order = "postorder"
    for mod in modules
])

# Collect module build inputs and declare build actions
# This is the only place where we flatten the depset
module_inputs = {}
for mod in ordered_modules.to_list():
    outputs = module_outputs[mod]
    inputs = depset(
        direct = [
            out
            for dep in mod[HaskellModuleInfo].direct_module_deps
            for out in module_outputs[dep]
        ],
        transitive = [
            module_inputs[dep]  # Will be set by a previous iteration, since all deps were visited before.
            for dep in mod[HaskellModuleInfo].direct_module_deps
        ],
    )
    module_inputs[mod] = inputs
    _run_ghc(
        outputs = outputs,
        inputs = inputs,
        ...
    )

This flattens the depset of all modules only once, and for each module only iterates over its direct dependencies.

aherrmann avatar Nov 04 '21 11:11 aherrmann

I'm impressed. That's a clever way to copy the structure of a depset.

However, I'm failing to see the essential difference with the previous sketch. Your last sketch still flattens the transitive inputs once per action, only that it is deferred until the very moment the action is executed. Is not it still quadratic?

Which also brings to the front the fact that once we require transitive inputs for all actions, there is no approach to implement it that isn't potentially quadratic, not even running the actions in the haskell_module rules as before.

facundominguez avatar Nov 04 '21 12:11 facundominguez

However, I'm failing to see the essential difference with the previous sketch. Your last sketch still flattens the transitive inputs once per action, only that it is deferred until the very moment the action is executed. Is not it still quadratic?

The difference is that it benefits from the sharing of depset nodes and avoids unnecessary flattening in the analysis phase. The performance documentation raises the concern about calling .to_list() in rule implementations, not about using depsets in general. See also here and here.

Which also brings to the front the fact that once we require transitive inputs for all actions, there is no approach to implement it that isn't potentially quadratic, not even running the actions in the haskell_module rules as before.

Yes, if we need those inputs, then we can't avoid that. But, by using depset efficiently in the rule implementation we can make best use of Bazel's mechanisms to optimize these kinds of situations.

aherrmann avatar Nov 04 '21 13:11 aherrmann

Should dependencies be inherited from libraries and binaries? gazelle_haskell_modules copies those into the dependencies of haskell_module, but maybe we should leave room to the user to tweak that manually (?).

facundominguez avatar Nov 04 '21 14:11 facundominguez

Should dependencies be inherited from libraries and binaries? gazelle_haskell_modules copies those into the dependencies of haskell_module, but maybe we should leave room to the user to tweak that manually (?).

Good question. I can see two ideals there:

  1. A haskell_module has no library deps but only module deps. At compilation we just forward all library deps that are defined at haskell_library|binary|test to the module compilation action.
  2. Each haskell_module has minimal library deps, i.e. only those haskell(_cabal)_library that that module actually depends on. At compilation we forward only the module's library deps (and I assume we also need transitive library deps from libraries and modules for template Haskell).

Option 1. is not optimal for incremental builds and parallelism, because a module may be rebuilt when an unused library dependency changes or its build may need to wait for an unused library to finish building. The unused_inputs_file feature could potentially help with this, but it's additional complexity to get this to work.

Option 2. is better for incremental builds and parallelism. But, it's harder to get right with gazelle_haskell_modules, because we would need a module index for all library targets, including external ones (@stackage) to get it right, which Gazelle doesn't have right now.

I don't see an obvious winner between the two. I would err on the side of option 2 and consider option 1 if we know that we can make unused_inputs_file work reasonably in this case.

aherrmann avatar Nov 04 '21 17:11 aherrmann

How about: 3. merging deps from libraries and the module, and offering a boolean attribute to disable the merging?

I think option (1) could be preferable if we can make unused_inputs_file work. So maybe we can just wait until we earn clarity on that before committing to a choice.

facundominguez avatar Nov 04 '21 21:11 facundominguez