ChezScheme icon indicating copy to clipboard operation
ChezScheme copied to clipboard

reproducible builds

Open attila-lendvai opened this issue 2 years ago • 7 comments

compilation doesn't seem to produce the same output when invoked multiple times on the same source file.

even the size of the result is changing:

(parameterize ([optimize-level 3] [fasl-compressed #f]) (compile-program "idris2.ss"))
27224054 Oct  8 12:39 idris2.so
27224026 Oct  8 12:36 idris2.so.1
27224054 Oct  8 12:37 idris2.so.2
27224054 Oct  8 12:38 idris2.so.3

for matching sizes the difference is not enormous:

$ cmp --verbose idris2.so idris2.so.2 | wc -l
700

for someone who knows the codebase it's probably something simple. typically unsorted iteration of a hashtable, or some other unordered collection.

or if it's not that simple, then this effort can even reveal bugs where something from the environment inadvertently leaks into the output.

potential candidates

  • multiple hashtable-entries calls in fasl.ss (unsorted iteration).

attila-lendvai avatar Oct 07 '21 23:10 attila-lendvai

Chez Scheme compiles are reproducible modulo the globally unique names of generated symbols (gensyms), assuming macros don't do things like introduce time stamps or random expressions into their output. Because of differences in gensym names, the files produced by multiple runs are not byte-by-byte identical but should compare equal via the system primitive $fasl-file-equal? See, for example, the use of $fasl-file-equal? in the s/Mf-base checkboot target. Making the files byte-by-byte equivalent could be done by compiling to a temporary file and overwriting the original object file (if any) only if $fasl-file-equal? returns #f. This could be useful to prevent unnecessary downstream recompiles, but it's a bit of a challenge to get this right when multiple compiles are done in the same session. One would probably have to "visit" the original object file if the new and original compare equal to get the right gensym bindings in place for subsequent compiles.

dybvig avatar Oct 28 '21 18:10 dybvig

@attila-lendvai This may not be the only thing preventing reproducible builds, but the usage of define-record is partially to blame. Just copying the first one I saw:

(define-record thread-handle (semaphore))

This will create a generative record type call "thread-handle" One of the distinctions between records is "generative" and "nongenerative". "generative" record definitions will create a new gensym to identify the record type every time the define-record form is encountered and "non-generative" record definitions will not. The way to not create a new gensym is to provide one explicitly when defining the record type:

(define-record-type #{thread-handle jl2ccp44uektivftbyzu9oubo-1}
  (semaphore))

This will be stable across re-compilations.

gwatt avatar Oct 28 '21 18:10 gwatt

Making the files byte-by-byte equivalent could be done by compiling to a temporary file and overwriting the original object file (if any) only if $fasl-file-equal? returns #f.

the kind of reproducibility that i'm pursuing here is the kind that is needed for trust: where any developer can download the sources, inspect them, and use the same version of the compiler to produce the same binary output. a typical scenario is when e.g. a firmware update is only accepted by the device when it is signed by the key of the central authority, but this central authority also wants to be transparent... so anyone can fetch the source and the signed binary, recompile the sources locally, compare the two binaries and check the signature.

it sound like it would be enough to simply specify the initial value of the gensym counter prior to compilation, but reading @gwatt 's comment, this is probably a more involved task than i initially thought.

background: the context where i'm experimenting with this is https://github.com/idris-lang/Idris2. it's a functional language that uses ChezScheme as a code generating backend.

the even larger context is implementing a DSL using Idris so that it compiles a proof carrying code to a simpler language, e.g. to scheme. this lower level language can then be executed by a simpler/smaller binary runtime. meanwhile, anyone can check the proof that the scheme code running is equivalent to the Idris compiler's output when compiling the corresponding DSL script.

attila-lendvai avatar Oct 28 '21 20:10 attila-lendvai

If developers are willing to use $fasl-file-equal? (which should probably be surfaced in any case as a user-level primitive) rather than 'cmp' to compare binary output, that should work. In effect, $fasl-file-equal? attempts to unify the two files with respect to generated names, so it is the same as straight binary identity with each gensym in one file replaced by the same gensym in the other file.

dybvig avatar Oct 30 '21 16:10 dybvig

I have some interest in this because I've contributed to the Chez Scheme package for Guix (a user-level package manager and full system distribution), which benefits greatly from byte-for-byte reproducible build to enable its "purely functional" features.

If I understand correctly, the issue (or at least part of it) is that Chez Scheme uses UUIDs for gensyms, including in generative record types. In particular, at least since https://github.com/cisco/ChezScheme/commit/2fd3db68230d094a0d396348a8140a4d3693b120/https://github.com/cisco/ChezScheme/issues/338, these are fully random "type 4" UUIDs.

I wonder if "type 5" UUIDs (RFC 4122 §4.3), which are deterministically generated based on a name in some conceptual "namespace", could be useful for making byte-for-byte equivalent object files. I'm not sure I fully understand all the nuances of Chez Scheme's use of gensyms, but I could imagine, for example, a tool that would rewrite object files to give the gensym's predictable names (perhaps incorporating a hash of the source file or something to help prevent collisions).

LiberalArtist avatar Nov 02 '21 05:11 LiberalArtist

I've looked at this a bit more, and I think I can see a strategy for gensyms to work with reproducible builds.

As I understand this code:

https://github.com/cisco/ChezScheme/blob/8c5f5cc27cf8f2b432aafc4f2fc640435294b74a/s/5_7.ss#L87-L129

Chez only calls unique_id (the C function that calls libuuid or whatever) once per process, to initialize $session-key. Unique names for gensyms are formed by concatenating the session key with a counter.

We can take inspiration from the way pretty/suffix printing mode recognizes gensyms from the current session:

https://github.com/cisco/ChezScheme/blob/8c5f5cc27cf8f2b432aafc4f2fc640435294b74a/s/print.ss#L634-L648

Before starting to compile some compilation unit, we could use set-up code like this to recognize gensyms forced after starting that compilation unit:

(define (gensym->counter g)
  (let ((u (gensym->unique-string g)))
    (string->number
     (substring u (string-length #%$session-key) (string-length u)))))
(define canary
  (gensym "canary"))
(define canary-counter
  (gensym->counter canary))
(define (gensym-after-canary? g)
  (and (equal? #%$session-key
               (substring (gensym->unique-string g)
                          0 (string-length #%$session-key)))
       (< canary-counter (gensym->counter g))))

My thinking is that we could compile the compilation unit to a fasl file as usual. Next, we'd rewrite the fasl file (perhaps reusing strip-fasl-file) to replace the unique names from the session with a well-known session key, say "chez0strip0in0progress123-", and deterministic counter values. Then we'd hash this intermediate fasl file to derive a reproducible session key. Finally, we'd rewrite the file again to use the reproducible session key.

Maybe all of the compilation procedures should embed the "canary" gensym (and maybe also a canary for the end of compilation?) in the object files they generate, so strip-object-file would have the information available to use later.

One thing I've left a bit vague in the above is what a "compilation unit" is. Also, I mostly use Chez Scheme through Racket, so I'm less sure about nuances that might arise from other use patterns.

LiberalArtist avatar Jul 01 '23 02:07 LiberalArtist

For reproducible builds required by nix-family of OS, that is relatable to @attila-lendvai use case: adding a parameter to set the random seed of unique_id so that the binaries produced by Chez's compile-qux are byte-by-byte identical (except for timestamps embedded by macros...). The problem with that is the binary; hence the program will weaker security wise.

For online trust [0], having a procedure such as (fasl-file-hash... nonce [fasl-object]) where NONCE is a globally unique bytevector provided by the user, and FASL-OBJECT an optional argument with the current running program as default, that replaces any gensym output in FASL-OBJECT with one and only one value. fasl-file-hash will produce a hash that can be compared by the user to the hash they produced locally with the program they compiled themselves.

[0] the client wants a proof that the output is the result of the agreed upon, and peer reviewed algorithm, without necessarily having to go through the woops of making EVERYTHING public, or heat ASIC at great cost.

amirouche avatar Sep 03 '23 14:09 amirouche