bazel
bazel copied to clipboard
Allow repositories that they depend on the contents of files/directories on the local file system
Description of the feature request:
An important use case for Starlark repositories is configuring the build for the local machine.
This sometimes requires reading files, calling readdir()
, i.e. interacting with the local file system. In some cases, it's desirable that the repository is not re-fetched if the local state changes (if it's expensive to do so), but in some other case, it is and we currently don't provide a good mechanism to do so:
-
rctx.read()
and friends can only declare dependencies on files in other repositories -
local_repository()
andnew_local_repository()
are only invalidated when the top-level directory of the repository changes and they can't pull in individual files (this matters for large directories e.g./etc
). The rationale behind the first limitation is that one wouldglob()
in the repository and declare dependencies on the files below the first level that way, but that doesn't work for other repository rules.
The only workaround at the moment is to declare a dependency on an environment variable that always changes, which has other undesirable effects (see #20951).
There is a separate, ancient feature request to always re-run a repository rule (#3041) but I think that's dangerous because it results in sloppiness with declaring dependencies.
Which category does this issue belong to?
No response
What underlying problem are you trying to solve with this feature?
No response
Which operating system are you running Bazel on?
No response
What is the output of bazel info release
?
No response
If bazel info release
returns development version
or (@non-git)
, tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD
?
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
This is a blocker for https://github.com/bazelbuild/bazel/issues/18285
Linking the design doc here again: https://docs.google.com/document/d/17RZKMuMjAIgNfFdrsLhqNsMTQDSS-BBA-S-fCSBPV7M/edit# I'll start impl work now.
@ismell brought up an interesting point today: if your level of abstraction is system calls (readdir()
and stat()
), you need to issue a Starlark function call for each file / directory you want to watch, which in his case is about 10K, all on one thread. (because you can't do multithreading in Starlark, at least not without jumping through an excessive number of hoops, e.g. by sharding it over multiple repositories)
If your abstraction was glob()
(i.e. something that's semantically "recursive"), Bazel could in theory parallelize "behind the scenes" to its heart's content.
something I realized while knee-deep in implementation is that I might have misunderstood what @ismell's FR was exactly. The original design includes an rctx.watch_dir()
-ish API that simply watches the directory listing (i.e. filenames under a directory). This was already useful; for example, it allows us to implement new_local_repository
in Starlark.
I thought we'd generalize that to rctx.watch_dir(recursive=True)
and thereby watch all entries transitively under the directory. But what @ismell really wants is more than just the directory listings; it's also any file content changes. And potentially to exclude some paths (I guess that's a P2). This seems much harder to do performantly; I also don't know what digest value I should store in the marker file (if it's just a checksum of everything, then we'll need to always finish reading all 10K files before we can determine whether anything has changed. On the flip side, I could store every path and its checksum, but then the marker file gets blown up in size).
@lberki @ismell any more insight here? As it is, I think I'll send a PR for the non-recursive watch_dir
API first. The recursive one seems to need some more design work... (which means it's at risk for 7.1.0)
really wants is more than just the directory listings; it's also any file content changes.
I think we can simplify that to if mtime
or ctime
were changed, then consider it a cache bust. Not sure we really need to hash all the files all the time.
Ah, then it's a classical case of us all three thinking that this feature request must obviously mean something, except that we didn't agree upon what "something" meant :)
I assumed that watch_dir()
would mean "directory and the files in it recursively", i.e. (conceptually) find . -type f | sort | xargs sha256sum | sha256sum
.
Taking the aforementioned 4K directories + 10K files as the example and assuming that you'd want to store:
- a path (say, 100 bytes on average)
- a SHA-256 checksum (64 bytes as hex digits)
- an mtime a ctime and an inode number as 64-bit integers also encoded as hex digits (3*16 bytes)
It comes out 14 * 1000 * (100+64+3*16) = ~3 megabytes, which is relatively large for a marker file, but as a file, it's perfectly reasonable and this is already an example that pushes the limits of Bazel in other ways.
If it proves to be too much, I can imagine storing some sort of Merkle tree hash, but why be complicated if simple does it?
I don't understand your "finish reading all files before determining that anything has changed" comment. In order to verify a cache hit, you have to verify everything anyways, don't you? It's true that using anything else than the primitive "list of individual fles" approach would require more work before determining that it's a cache miss, though, maybe that's what you mean?
Yeah, I also think we'll have to have one record for each file under the directory, and with https://github.com/bazelbuild/bazel/issues/21044, hopefully the performance won't be that bad?
@ismell If the performance does turn out to be an issue, do you think chromeos can work around this be only watch a particular directory to reduce the size of watched files?
FWIW, this is our current digest generation code.
It essentially boils down to:
$ find -L . -iname .git -prune -o ! -type d -exec sha256sum '{}' + | sort -k 2 | sha256sum
We have a todo to switch over to just running stat
instead of sha256sum
on each file:
$ find -L . -iname .git -prune -o -print0 | sort -z | xargs -0 stat | sha256sum
So, for our use case, I don't think you necessarily need to store all the individual files in the MARKER file, but just an easily computed hash.
@ismell If the performance does turn out to be an issue, do you think chromeos can work around this be only watch a particular directory to reduce the size of watched files?
Unfortunately we need to watch all the directories. A file addition, deletion, or modification to any of the portage overlay directories needs to retrigger our repository rule. The workaround we implemented (which this issue is trying to remove) is to have a global CACHE_BUST_DATE
env variable that our tools/bazel
script sets to date
. This forces the @portage_digest
code to always run. We use the generated digest
file as an input into the @portage
repository so it gets re-generated when any of the "watched" files change.
And potentially to exclude some paths (I guess that's a P2)
We might be able to work around that. We know where the .git directories are located, so we could manually traverse the first layer:
for overlay in ["src/private-overlays",...]:
for d in rctx.readdir(overlay):
if d.name == ".git":
continue
if d.is_dir:
rctx.watch_dir(d, recursive=True)
else:
rctx.watch(d)
I don't understand your "finish reading all files before determining that anything has changed" comment. In order to verify a cache hit, you have to verify everything anyways, don't you? It's true that using anything else than the primitive "list of individual fles" approach would require more work before determining that it's a cache miss, though, maybe that's what you mean?
Indeed, I was thinking about the cache miss case. But you're right that cache hits will always need digesting everything anyway, so I imagine just storing an everything hash should be fine.
I assumed that
watch_dir()
would mean "directory and the files in it recursively", i.e. (conceptually)find . -type f | sort | xargs sha256sum | sha256sum
.
Yeah, so I'm now leaning towards introducing a distinct API for this, instead of tacking it on to rctx.watch_dir(recursive=True)
. While we're there, I'm also considering renaming rctx.watch_dir()
to something that makes it clear that it only watches the dirents, and nothing more. Or maybe I'll just tack the watching onto path.readdir()
(maybe with the yes/no/auto thing again), which seems natural.
This recursive shasum thing you described would be a separate API, let's say rctx.watch_tree()
or some such. That still seems good to me.
We have a todo to switch over to just running
stat
instead ofsha256sum
on each file:
At this point, my concern is starting to become "this is looking like a very chromeos-specific API". I'm a bit worried about overfitting. Do we think "watching the stat() of all transitive children under a tree" is a generally useful API that we'll get a second user? I'm a bit hesitant to create something that only ChromeOS will ever use (but not staunchly against it, full disclosure).
Is running stat
instead of sha256sum
a goal or is it just a performance optimization with the tradeoff that sometimes incorrectness is accepted?
@ismell 's desire was to not re-fetch the repo if the contents of a file changed but its mtime didn't, that would be indeed highly specific to Chrome OS and not very useful anywhere else, but it'd be quite a strange thing to want.
My guess is that it's just an optimization (false cache misses are acceptable). Which is honestly generally true for repo rules -- we never said "you're guaranteed to not be refetched unless X happens"; the user could always bazel clean --expunge
which results in the ultimate cache miss, if you will. IOW, we could treat all repos as local=True
and no contract will have been violated (although lots of people will be very unhappy).
So that likely means we can still say "rctx.watch_tree()
checksums everything" but secretly only checksum the mtimes. That sounds good to me. @lberki, I'd love to hear your thoughts on my previous comments about renaming watch_dir
and splitting off watch_tree
.
(caveat: my last comment assumes "file content changes imply mtime changes" which I realize isn't technically true. But let's assume that's true for that specific argument...)
Is running stat instead of sha256sum a goal or is it just a performance optimization with the tradeoff that sometimes incorrectness is accepted?
It was just a performance optimization at the cost of re-running the repository rule if a file is touch
ed.
Wouldn't expanded archives with fixed mtimes break the optimization, similar to what used to be broken with https://github.com/bazelbuild/bazel/issues/14723?
They would (in fact, @tjgq and myself just ran into this yesterday). This is why I listed mtime, ctime and inode number in my comment above when calculating the space requirements for the lockfile.
Do we think "watching the stat() of all transitive children under a tree" is a generally useful API that we'll get a second user?
I think maybe we can use it to watch the cc toolchain directory, so that local_config_cc
is automatically fetched when the system compiler is updated?
For the cc toolchain use case, isn't what we are ultimately interested in the contents? (it looks like @ismell says that hat's the case for them, too, so I think the point is moot)
A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!