gopherjs icon indicating copy to clipboard operation
gopherjs copied to clipboard

compiler: use hash calculation for determining archive staleness

Open myitcv opened this issue 6 years ago • 13 comments

Work towards #804 (will require a follow up to zero modification times)

Fixes #440 and likely other stale cache related issues.

myitcv avatar Apr 23 '18 07:04 myitcv

@shurcooL @hajimehoshi this is ready for discussion and review.

myitcv avatar Apr 23 '18 08:04 myitcv

Basically what this PR does is to remove SrcModTime?

hajimehoshi avatar Apr 23 '18 15:04 hajimehoshi

Basically what this PR does is to remove SrcModTime?

Currently staleness of an archive is determined using mod time of the transitive set of input files, the gopherjs binary and the archive itself.

This PR switches from using modtimes (which can give false positives and incorrect answers) to using a hashing strategy similar to the build cache in Go 1.10 (although nowhere near as advanced - we still maintain just a single archive).

It also has the advantage of allowing us to introduce other aspects into the staleness calculation, such as build tags which I have also done and this fixes #440.

This PR also likely fixes a number of other staleness/caching issues that randomly get seen/reported.

It also paves the way for fixing #804 by allowing us to zero the modification times.

myitcv avatar Apr 23 '18 15:04 myitcv

Thanks for taking a look @hajimehoshi - addressed your feedback. PTAL.

myitcv avatar Apr 23 '18 16:04 myitcv

@hajimehoshi - I've tweaked a couple of things in my latest commit which also addresses the points you made. I've also moved from gopherjs binary hash to compiler binary hash in case something other than gopherjs is importing this build package. PTAL.

myitcv avatar Apr 24 '18 06:04 myitcv

@hajimehoshi thanks for taking a look. I've responded to your questions and removed the --bv flag for gopherjs test.

myitcv avatar Apr 24 '18 12:04 myitcv

@shurcooL, please take a look.

Sorry, but I don't have the bandwidth to properly review this change soon (maybe the coming weekend, but maybe not until the one after).

I have the following concern and question: how can we gain enough confidence that this change will not introduce regressions in the build staleness calculations? It would be quite bad if the compiler starts to use old code and not rebuild in some situations, and there are many edge cases.

Also, this PR changes code, but I would first like to see a high level description of the algorithm used, and review that. Afterwards, it would be easier to review the code and make sure it implements the algorithm.

Thanks for working on this. I agree in general that using hash of content can be better than the modify time, as it avoids unnecessary rebuilds when the content hasn't changed but modtime has, but it needs to work in all cases that the previous approach handled. I see you also took on fixing #440 which is great, but that issue needs to be re-verified whether it still exists in latest master version. I hope we can get this PR to a state where we can merge it with high confidence.

dmitshur avatar Apr 24 '18 18:04 dmitshur

Hi @shurcooL - very happy to add more detail. I pushed up the code to help with our discussion.

What follows is based on my understanding of how archive staleness is currently, i.e. in master, calculated. To state the obvious, this understanding might be wrong/incomplete.

Below, ModTime is always a reference to os.FileInfo.ModTime()

how can we gain enough confidence that this change will not introduce regressions in the build staleness calculations?

Currently, an archive for a package P is deemed stale if its ModTime (i.e. the ModTime of the archive on disk) is before either of:

  • ModTime of the gopherjs binary; more specifically, the running program that has imported github.com/gopherjs/gopher/build
  • The latest (i.e. most recent) ModTime of the transitive dependencies of P, including standard library packages
  • The latest (i.e. most recent) ModTime of the Go and JS files of P

Test programs and main packages are not archived.

Here ModTime is acting as a proxy for the contents of the files in question; whether that be the gopherjs binary, or the archives of the dependencies of P. It is therefore possible, and does happen, that the ModTime falls out of "sync" with the contents; i.e. the contents change without the ModTime changing or vice versa. It is therefore an imperfect proxy.

Far better is to follow an approach that uses the actual contents of those files to determine staleness. A hash-based approach does exactly this. The approach I've adopted here is similar to the build cache introduced in Go 1.10, but differs in one important respect: with the go tool, the computed hash effectively acts as a pointer to an area of the cache. With the approach adopted here, the computed hash is simply used to determine whether the archive we have on disk is stale or not. The approach taken by the go tool is far more advanced and robust, but requires a cache eviction policy (else your disk runs out of space); hence as a first cut I have gone for the "hash as a means of determining staleness" approach. A more complex approach could be adopted but doesn't seem particularly worthwhile at this stage.

I would first like to see a high level description of the algorithm used, and review that

The algorithm is therefore quite simple: we take all inputs that go into determining/building a package archive, hash them in a well defined order, compare that hash with the hash in the current archive. If the hashes are different then the archive on disk is stale, at which point we recompile it and rewrite it to disk.

This approach allows us to include arbitrary data in the input hash, including build tags.

So the inputs to the hash calculation are now:

We use crypto/sha256 just as the go tool does for all our hash calculations.

... but it needs to work in all cases that the previous approach handled

Because ModTime is an imperfect proxy for the file contents, there are many cases that don't get handled with the ModTime approach. And it's hard to know when we are going to be bitten by those edge cases. Using the hash-based approach of file contents we side step this entirely and solve all cases.

The one case that is neither handled by the hash-based nor the current ModTime approaches is that of race conditions on the disk on archive. That is, if two gopherjs processes are trying to determine staleness/write an archive at the same time; they effectively race. But given this is not currently handled we can simply note the exception and fix in a future PR if required.

I see you also took on fixing #440 which is great, but that issue needs to be re-verified whether it still exists in latest master version

Based on my understand this will still be broken on master because build tags are not currently part of the calculation for staleness, hence the bug. With this PR they are, hence staleness is correctly calculated.

I hope we can get this PR to a state where we can merge it with high confidence.

So to my mind we can have a high degree of confidence with this PR if we can be confident that we have the correct list of inputs; and I believe we do.

The best way to see this in action (in the absence of specific tests at this point) is to:

cd $(go list -f '{{.Dir}}' github.com/gopherjs/gopherjs) && \
git fetch origin pull/805/head && \
git checkout FETCH_HEAD && \
go install
gopherjs serve -v

Then try out something like the playground via http://localhost:8080/github.com/gopherjs/gopherjs.github.io/playground/. On the first load you should see the following output:

Cache miss for errors
Cache miss for internal/race
Cache miss for runtime
Cache miss for runtime/internal/sys
Cache miss for github.com/gopherjs/gopherjs/js
Cache miss for sync/atomic
Cache miss for sync
Cache miss for io
Cache miss for unicode
Cache miss for unicode/utf8
Cache miss for bytes
Cache miss for bufio
Cache miss for math
Cache miss for syscall
Cache miss for internal/poll
Cache miss for time
Cache miss for github.com/gopherjs/gopherjs/nosync
Cache miss for internal/testlog
Cache miss for os
Cache miss for strconv
Cache miss for reflect
Cache miss for fmt
Cache miss for sort
Cache miss for go/token
Cache miss for strings
Cache miss for path/filepath
Cache miss for go/scanner
Cache miss for go/ast
Cache miss for io/ioutil
Cache miss for go/parser
Cache miss for text/tabwriter
Cache miss for go/printer
Cache miss for go/format
Cache miss for golang.org/x/tools/go/ast/astutil
Cache miss for path
Cache miss for regexp/syntax
Cache miss for regexp
Cache miss for github.com/gopherjs/gopherjs.github.io/playground/internal/imports
Cache miss for encoding/binary
Cache miss for encoding
Cache miss for math/bits
Cache miss for encoding/gob
Cache miss for encoding/base64
Cache miss for unicode/utf16
Cache miss for encoding/json
Cache miss for container/heap
Cache miss for math/rand
Cache miss for math/big
Cache miss for go/constant
Cache miss for go/types
Cache miss for github.com/gopherjs/gopherjs/compiler/astutil
Cache miss for github.com/gopherjs/gopherjs/compiler/typesutil
Cache miss for github.com/gopherjs/gopherjs/compiler/analysis
Cache miss for github.com/gopherjs/gopherjs/compiler/filter
Cache miss for github.com/gopherjs/gopherjs/compiler/prelude
Cache miss for github.com/gopherjs/gopherjs/compiler/vendor/github.com/neelance/astrewrite
Cache miss for net/url
Cache miss for text/template/parse
Cache miss for text/template
Cache miss for go/doc
Cache miss for log
Cache miss for go/build
Cache miss for text/scanner
Cache miss for golang.org/x/tools/go/types/typeutil
Cache miss for github.com/gopherjs/gopherjs/compiler
Cache miss for github.com/neelance/go-angularjs
Cache miss for honnef.co/go/js/dom
Cache miss for honnef.co/go/js/util
Cache miss for honnef.co/go/js/xhr
Cache miss for github.com/gopherjs/gopherjs.github.io/playground

If you empty cache and hard reload the page you should see:

Cache miss for github.com/gopherjs/gopherjs.github.io/playground

And that is because main packages and test programs are not cached.

Indeed if you kill and restart gopherjs serve -v and force reload the page you will see the same output; because none of the inputs have changed with respect to the hash calculation.

If you then kill gopherjs serve -v and instead gopherjs serve --tags blah -v, force reload the page you will see all the cache misses again.

Would very much welcome feedback/questions on the above and this PR.

Thanks.

myitcv avatar Apr 28 '18 10:04 myitcv

I've also just added a basic test to show this in action and working.

myitcv avatar Apr 28 '18 12:04 myitcv

Got hit by another staleness bug again this afternoon. Still unclear how the situation with mod times getting out of sync comes about, but the hash-based approach has been 100% good since I started using it.

myitcv avatar May 05 '18 15:05 myitcv

Fixes #827 .

antong avatar May 29 '18 20:05 antong

I've been using this for a while now without problems. Would be really cool to have this merged as the current mod staleness calculation can't be trusted (#827).

antong avatar Sep 03 '18 17:09 antong

The codebase have moved on too far to make this pull request easily mergeable, but I think the approach it takes is a good one.

Recently merged https://github.com/gopherjs/gopherjs/pull/1105 set up some build caching infrastructure that could even allow us to use hash as a cache key (something https://github.com/gopherjs/gopherjs/pull/805#issuecomment-385162781 mentions as a possible improvement). This might also be the solution for the excessive cache busting https://github.com/gopherjs/gopherjs/pull/1110 pointed out.

However, there is a tricky issue that neither the current time-based invalidation nor hash-based invalidation in this PR handles: changes to standard library overlays that parseAndAugment function applies. It can both change the set of imported packages (by adding github.com/gopherjs/gopherjs/js or removing some other packages) and the content of the packages. Thus, computed hashes would be mostly correct, but not quite.

Arguably, this is relatively unlikely to affect normal gopherjs users who are not in a habit of patching standard library sources or editing gopherjs overlays. But this does expose a bigger flaw in the GopherJS's build system, and we need to have an idea for how it could be solved with or without this change.

nevkontakte avatar Apr 17 '22 18:04 nevkontakte