zinc Make Zinc compilations reproducible

It'd be useful to be able to support 'reproducible' builds from sbt/zinc (in the https://reproducible-builds.org/ sense).

One current source of nondeterminism in generated artifacts is the fact that JarOutputStream and ZipOutputStream from java.util are used in IO.jar(), which will include timestamps in the generated jar file. Also I'm not sure the ordering of the files in the archive is deterministic.

There's generally 2 ways to make builds reproducible: generating the assets in a deterministic way, or post-processing them. I'd say generating the jars in a deterministic way would be the nicest approach and make it easiest to integrate into an sbt project.

What would be a good place for extension points so this behavior can be overridden to be deterministic? Or would it even be acceptable to make this the default behavior?

Jul 04 '17 06:07 raboof

I am generally in support for making effort towards reproducible or referentially transparent builds. If we are going to do it, it would be useful to have some agreements across the build tools, so I thought this would be a good forum to discuss it.

The jar files in sbt is made using the module called Package - https://github.com/sbt/sbt/blob/1.0.x/main-actions/src/main/scala/sbt/Package.scala (Maybe we should move it to Zinc?)

IO.jar is implemented here - https://github.com/sbt/io/blob/21ed6ec2a2e33fd88a5f013f643d04fda26de4de/io/src/main/scala/sbt/io/IO.scala#L446

Unless there are significant performance cost, I think it would make sense to move towards generating deterministic jars.

Jul 04 '17 06:07 eed3si9n

Hey @raboof, thanks for opening a ticket, this is something I've devoted some thought lately.

I believe that the fixes to make compilation more reproducible need to be merged in sbt rather than Zinc. A good example that backs this statement up is Pants, which is already providing reproducible Scala compiles with the current Zinc (as of Zinc 1.0.x). If you want to pursue this further, I suggest you have a look at their source code.

That being said, if we see that there is some core logic that is shared across reproducible build tools and can be merged upstream, I'm up for merging it. But in my opinion the best way to move forward is to make changes in the build tools first, and then abstract over the requirements to get reproducible compiles.

Jul 04 '17 08:07 jvican

Thanks for reaffirming that this is a useful goal. I'll definitely check out how Pants solves this (though this is a spare-time thing for me so I can't really promise when I'll find significant time to spend on this ;) ).

Jul 04 '17 09:07 raboof

@raboof Any help is appreciated, no matter when it comes. :+1:

Jul 04 '17 09:07 jvican

I had a look at pants - AFAICS they actually don't produce reproducible builds: I created a small test project with a Scala jvm_binary, and when I rebuild after clearing my .pants.d it creates a jar with new timestamps.

Jul 15 '17 19:07 raboof

Correct: pants builds are not currently bitwise identical. zinc is one piece of that problem, but there are plenty of others. We're moving toward an execution model that will allow for bitwise reproducibility, but it's not clear yet how we will make that work in the presence of zinc, given that the zinc analysis is a sort of mutable state that needs to cycle back into each re-execution of the zinc process.

Moving zinc analysis on disk to being entirely hash based rather than timestamp based would be one important piece here... also stabilizing the order of all collections in the analysis.

Jul 17 '17 22:07 stuhood

@stuhood Agreed that removing any part that relies on timestamps in Zinc is important, and that's why I opened #371 which will allow us to use file fingerprints for everything: class files, binaries, sources. I also agree that stabilizing the order of all collections is important too.

In my opinion, there's room for compromise when it comes to reproducible builds. I think we should be optimizing for the use case where you clone a project and checkout a branch. In those cases, you do want your build to be reproducible. But when it comes to your developer environment, it seems unproductive to ditch incremental compilation as a whole only to get reproducibility for errors that rarely happen. I think there's a good tradeoff here, and I don't mind giving up some of my reproducibility to get faster compile times.

An idea that @smarter gave me last weekend is that we could test how good incremental compilation is now by creating a script that compiles from scratch every commit in a repository and then tries to detect where the incremental compilation is producing different class files than the batch compilation. I think that would be a good experiment to convince users how rare inconsistencies due to incremental compilations are. And if they are not rare, we can always work to make them disappear.

Jul 20 '17 15:07 jvican

I agree completely that ditching incremental compilation as a whole only to get reproducibility would not be good.

I assume you mean "how rare inconsistencies due to incremental compilations are"?

I'd really like to have fully-reproducible "published artifacts", but during development I don't care so much (and would prefer to have fast builds over reproducible ones :) )

Jul 20 '17 15:07 raboof

@raboof Thanks for the catch! :smile: @stuhood What's your view on this?

Jul 20 '17 15:07 jvican

An idea that @smarter gave me last weekend is that we could test how good incremental compilation is now by creating a script that compiles from scratch every commit in a repository and then tries to detect where the incremental compilation is producing different class files than the batch compilation.

That's a great idea.

@stuhood What's your view on this?

In the absence of incremental compile, the primary portions that would be important to make reproducible would be the output classfiles, which are mostly scalac's responsibility. If zinc were to start jarring outputs on its own, either it or scalac would probably want to support an option to disable/zero file timestamps on written files.

Jul 20 '17 20:07 stuhood

I've set up something similar and we've get a tons of differences (I've compared crc in jar entries). After further examination it turns out that we've got a lot of differences even in full builds of the same commit on different machines. Comparing classes using javap shows that:

scalac sometimes generate bytecode in different order (in terms of order of methods) . This is small problem and pretty easy to overcome.
scalac sometimes generates different names for e.g. inner classes and this makes whole idea of comparing jars pretty unrealistic. We are working on small reproduction so we can talk with scala team about potential fixes.

cc @mkeskells @rorygraves

Jul 26 '17 20:07 romanowski

Tree traversal is supposed to be deterministic. What we should ensure, though, is that all the compiler inputs are deterministic too. One example to achieve that is to sort the inputs.

@romanowski Let me know if something comes up related to incremental compilation.

Jul 26 '17 21:07 jvican

@jvican this was on a full compile, not incremental. I don't believe that we currently sort the input files, but maybe this could be a -Y option, or the default. Specifically 2 compilations from the same Git commit, but from different CI compilations ( so different actual directories) and on Linux (I think) The differences were noticed in 2.11 compiler, but as I understand it this would be generating methods in 2.12 by the same naming means

some of the differences were classes generated by macros, and others were from anonymous blocks

@romanowski also saw differences in the names of the methods generated for default argument access. I have not seem that, and we do have some special code for these default parameters in our plugin, so this may or may not be our issue

Anecdotally the numbers generated by classes from macros were very high, so I thing that that generation is using a name generator backed on the macro name, not the full scope

Jul 26 '17 22:07 mkeskells

this was on a full compile, not incremental. I don't believe that we currently sort the input files, but maybe this could be a -Y option, or the default. Specifically 2 compilations from the same Git commit, but from different CI compilations ( so different actual directories) and on Linux (I think)

Yes, that was my impression from reading @romanowski previous comment. Let me know if there's changes with incremental compilation, I would be interested in having a look.

I don't think that sorting should go into the compiler. It's something the build tools should be doing, at least for now. In my opinion, we should encourage such a change in sbt and other "popular" build tools like Gradle / Maven.

Anecdotally the numbers generated by classes from macros were very high, so I thing that that generation is using a name generator backed on the macro name, not the full scope

I think this is how it's implemented in paradise. But I don't remember well, it's been a while since I read the sources. However, if this is happening, it means that the typechecker has not deterministic traversals. Otherwise it wouldn't matter where the synthetic counter comes from.

Jul 26 '17 22:07 jvican

@jvican for the macros - my point was that as we are getting numbers related to the use of the macros then they cannot be deterministic for incremental compilation WRT full compilation, so will need to be addressed seperately. We want to verify eventually that incremental compilation generates the same output as a full compile, but currently 2 full compiles are not generating the same output, so we want to solve that first

we are seeing numbers for the macros generation up to 100K and where are macros that will only run once at a given location

WRT to ordering and external tools I do think that this is part of the role of the compiler to generate the same output independent of the input ordering. We may have different operations the run compilation (CI, command line, IDE(s)), and they should generate the same output without trying to follow some artificial and arbitrary rules duplicated in N build environments. IMO the compiler is compiling a set of files in a project, and for incremental a subset of these. Lets see what the actual cause of the indeterminacy is first though, as the file ordering is just a potential

As I read the compiler, I dont think that the files to be compiled are compiled in the order that they are specified Global.Run.CompileUnits adds all of the files to 'unitBuf' a mutable.HashSet[String] and then the units are traversed in order specified by def units ( below) which has a warning dating back to 2011.

/* An iterator returning all the units being compiled in this run */
/* !!! Note: changing this to unitbuf.toList.iterator breaks a bunch
   of tests in tests/res.  This is bad, it means the resident compiler
   relies on an iterator of a mutable data structure reflecting changes
   made to the underlying structure.
 */
def units: Iterator[CompilationUnit] = unitbuf.iterator

Not sure if the warnings are related, but making this a SortedSet woould make this order ddetermined @paulp do you recall what the issue was, as it seems from Git that you wrote the comment

Jul 26 '17 22:07 mkeskells

I wrote the comment, but I don't know what I can add. The issue was that somewhere in the resident compiler an iterator was created but not immediately used, and tests depended on the iteration order reflecting the state of unitbuf at the time of iteration instead of at the time of creation. Whether it still is so, I could not guess.

Jul 26 '17 23:07 paulp

zinc zinc copied to clipboard

Make Zinc compilations reproducible

zinc
zinc copied to clipboard