rules_haskell icon indicating copy to clipboard operation
rules_haskell copied to clipboard

Implement unused_inputs_list for haskell_module

Open facundominguez opened this issue 3 years ago • 10 comments

rules_haskell provides the interface file of a module M as input to the action that builds every module that imports it transitively. This is because the compiler might need to read the interface file for M, and the build system can't predict reliably whether it will be needed or not.

If the interface file of M changes, then Bazel will arrange to rerun all those actions whether the interface files of the modules along the path in the dependency graph are needed or not. A remedy for this could be to use unused_input_list to report to bazel the interface files that don't affect the output. This might be possible to do if we can get the compiler to report which interface files have been actually read during compilation.

It needs to be investigated whether this is possible to implement with GHC as it is today, or perhaps with the GHC API, or perhaps by using strace.

Related https://github.com/tweag/rules_haskell/issues/1552

facundominguez avatar Jun 13 '22 20:06 facundominguez

The strategy adopted in #1771 is to generate not only the interface file, but also a file containing only the ABI hash stored in the interface. This hash is the one used by GHC to know if recompilation is necessary.

There are situation were this ABI hash is not modified, whereas the interface file is. It is in particular the case when the export list of a dependency is modified, or the type of a function exported by a function is modified. For instance, if a file A.hs contains:

module A (f) where

a :: Bool
a = True

f :: a -> b -> a
f x y = x

and a file B.hs imports A, then adding a to the export list of A would modify the interface file of A, and since B imports A the list of imported modules stored in the interface file of Bhas changed. However the ABI hash of B is not changed, since this change does not affect the modules importing B.

Up to this point, the situation described is the following, whenever one compiles a module, it produces an interface file, which contains a lot of information, is modified often and is required to compile the files which depends of this module and an ABI hash which contains the minimal information required to know if recompilation is required and is modified less often.

So the goal is to set the inputs of a module in order to know if a recompilation is required, and in case it is, to retrigger it. To do so, the run_ghc action takes as input not only the transitive closure of the interface files required, but also the ABI files of the direct dependencies of the module. Then, we use the unused_inputs_list to pretend that interface files are "unused", hence the caching mechanism of Bazel will not detect changes in the interface files that did not affect the ABI hashes, saving us some useless compilation. On the other hand, when the ABI hash has changed, since the interface files are provided in the inputs, Bazel won't blame us for lying to it and will gladly use the interface files to rebuild the target.

This deflect the documented use of unused_inputs_list, but seems to be working quite effectively on the tested examples.

GuillaumeGen avatar Jul 08 '22 15:07 GuillaumeGen

#1771 also introduces small example: basic_modules. It contains 3 files:

A.hs:

module A (f, g, T, a) where

data T = T0 | T1

a :: T
a = T1

f :: a -> b -> a
f x y = x

g :: T -> T
g T0 = a
g T1 = T1

B.hs:

module B where

import qualified A (f, g, T)

f :: A.T -> A.T
f x = A.f (A.g x) x

g :: a -> a -> a -> a
g x y = A.f (A.f x) (A.f x)

and C.hs:

module C where

import qualified B

data N = Z | S N

f :: a -> a -> a
f x = B.g x x

There are then quite a lot of changes one can make to A.hs or B.hs which changes the interface file B.hi but do not affect the ABI hash of B avoiding to trigger the recompilation of C.hs.

Changing the export list of A:

- module A (f, g, T, a) where 
+ module A (f, g, T) where 

Change the type of a function of A:

- f :: a -> b -> a
- f x y = x
+ f :: a -> b -> b
+ f x y = y

Inline a function of A:

+ {-# INLINE f #-}
f :: a -> b -> a

Modify the import list of B:

- import qualified A (f, g, T)
+ import qualified A (f, g, T, a)

All those modifications affects the part of the interface file of B regarding imports, hence it changes B.hi but not B.abi, so without #1771 it triggers a recompilation of C, it is not the case anymore.

GuillaumeGen avatar Jul 08 '22 15:07 GuillaumeGen

That approach in #1771 is really clever. IIUC it's essentially projecting out the minimal data that we want to take into account for cache invalidation and telling Bazel to ignore the rest.

That said, I don't have a good understanding of what the ABI hash actually represents. @GuillaumeGen can you point to some documentation, or add a note to your PR to explain it?

Inline a function of A:

+ {-# INLINE f #-}
f :: a -> b -> a

This is an interesting example. Does the addition of an INLINE pragma change A.abi? If not, that could be an issue, because it may mean that B does not get rebuild when it should take new inlinings into account.

aherrmann avatar Jul 12 '22 09:07 aherrmann

Yes, adding an {-# INLINE #-} pragma modifies the ABI file. Furthermore, if both A and B are compiled with optimizations (the -O option), then a modification of the implementation of f in A modifies both the ABI hash of A and B.

GuillaumeGen avatar Jul 12 '22 10:07 GuillaumeGen

Regrading the description of the ABI hash and how it works, all this is explained in https://gitlab.haskell.org/ghc/ghc/-/wikis/commentary/compiler/recompilation-avoidance#how-does-it-work You are right that I should write a comment about it in the code.

GuillaumeGen avatar Jul 12 '22 10:07 GuillaumeGen

@GuillaumeGen Thanks for sharing this info!

Reading the docs I see

In order to determine the second point, we look at the dependencies and usages fields of the old interface file. The dependencies contains:

  • dep_direct_mods: Direct dependencies of home-package modules that are imported by this module.
  • dep_direct_pkgs: Direct packages depended on by this module.
  • other less important stuff.

Could dep_direct_mods and potentially dep_direct_pks be used to identify unused direct module dependencies? In the context of this issue this could avoid unnecessary recompilation when declared but unused dependencies change. Alternatively, could it be used for unused dependency checking, similar to what rules_scala offers.

aherrmann avatar Jul 18 '22 13:07 aherrmann

Could dep_direct_mods and potentially dep_direct_pks be used to identify unused direct module dependencies?

How would you determine then which interface files are unused?

facundominguez avatar Jul 19 '22 11:07 facundominguez

Could dep_direct_mods and potentially dep_direct_pks be used to identify unused direct module dependencies?

How would you determine then which interface files are unused?

I was thinking of something along the lines of "Set of unused modules = Set of module deps - dep_direct_mods".

aherrmann avatar Jul 19 '22 11:07 aherrmann

Set of unused modules = Set of module deps - dep_direct_mods

It ultimately depends on how you come up with Set of module deps, but I'm guessing that Set of unused modules would end up including all of the transitively imported modules which aren't directly imported, and which might be used.

facundominguez avatar Jul 19 '22 11:07 facundominguez

Yes, as I understand this could only be used to determine unused direct module dependencies.

aherrmann avatar Jul 19 '22 12:07 aherrmann

Is here something left to do or can we close this with #1771 having been merged?

avdv avatar Dec 07 '22 08:12 avdv

I think we can close.

GuillaumeGen avatar Dec 07 '22 09:12 GuillaumeGen