bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Allow repositories that they depend on the contents of files/directories on the local file system

Open lberki opened this issue 1 year ago • 18 comments

Description of the feature request:

An important use case for Starlark repositories is configuring the build for the local machine.

This sometimes requires reading files, calling readdir(), i.e. interacting with the local file system. In some cases, it's desirable that the repository is not re-fetched if the local state changes (if it's expensive to do so), but in some other case, it is and we currently don't provide a good mechanism to do so:

  • rctx.read() and friends can only declare dependencies on files in other repositories
  • local_repository() and new_local_repository() are only invalidated when the top-level directory of the repository changes and they can't pull in individual files (this matters for large directories e.g. /etc). The rationale behind the first limitation is that one would glob() in the repository and declare dependencies on the files below the first level that way, but that doesn't work for other repository rules.

The only workaround at the moment is to declare a dependency on an environment variable that always changes, which has other undesirable effects (see #20951).

There is a separate, ancient feature request to always re-run a repository rule (#3041) but I think that's dangerous because it results in sloppiness with declaring dependencies.

Which category does this issue belong to?

No response

What underlying problem are you trying to solve with this feature?

No response

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

lberki avatar Jan 19 '24 09:01 lberki

This is a blocker for https://github.com/bazelbuild/bazel/issues/18285

meteorcloudy avatar Jan 24 '24 13:01 meteorcloudy

Linking the design doc here again: https://docs.google.com/document/d/17RZKMuMjAIgNfFdrsLhqNsMTQDSS-BBA-S-fCSBPV7M/edit# I'll start impl work now.

Wyverald avatar Jan 30 '24 20:01 Wyverald

@ismell brought up an interesting point today: if your level of abstraction is system calls (readdir() and stat()), you need to issue a Starlark function call for each file / directory you want to watch, which in his case is about 10K, all on one thread. (because you can't do multithreading in Starlark, at least not without jumping through an excessive number of hoops, e.g. by sharding it over multiple repositories)

If your abstraction was glob() (i.e. something that's semantically "recursive"), Bazel could in theory parallelize "behind the scenes" to its heart's content.

lberki avatar Feb 01 '24 20:02 lberki

something I realized while knee-deep in implementation is that I might have misunderstood what @ismell's FR was exactly. The original design includes an rctx.watch_dir()-ish API that simply watches the directory listing (i.e. filenames under a directory). This was already useful; for example, it allows us to implement new_local_repository in Starlark.

I thought we'd generalize that to rctx.watch_dir(recursive=True) and thereby watch all entries transitively under the directory. But what @ismell really wants is more than just the directory listings; it's also any file content changes. And potentially to exclude some paths (I guess that's a P2). This seems much harder to do performantly; I also don't know what digest value I should store in the marker file (if it's just a checksum of everything, then we'll need to always finish reading all 10K files before we can determine whether anything has changed. On the flip side, I could store every path and its checksum, but then the marker file gets blown up in size).

@lberki @ismell any more insight here? As it is, I think I'll send a PR for the non-recursive watch_dir API first. The recursive one seems to need some more design work... (which means it's at risk for 7.1.0)

Wyverald avatar Feb 14 '24 01:02 Wyverald

really wants is more than just the directory listings; it's also any file content changes.

I think we can simplify that to if mtime or ctime were changed, then consider it a cache bust. Not sure we really need to hash all the files all the time.

ismell avatar Feb 14 '24 03:02 ismell

Ah, then it's a classical case of us all three thinking that this feature request must obviously mean something, except that we didn't agree upon what "something" meant :)

I assumed that watch_dir() would mean "directory and the files in it recursively", i.e. (conceptually) find . -type f | sort | xargs sha256sum | sha256sum.

Taking the aforementioned 4K directories + 10K files as the example and assuming that you'd want to store:

  • a path (say, 100 bytes on average)
  • a SHA-256 checksum (64 bytes as hex digits)
  • an mtime a ctime and an inode number as 64-bit integers also encoded as hex digits (3*16 bytes)

It comes out 14 * 1000 * (100+64+3*16) = ~3 megabytes, which is relatively large for a marker file, but as a file, it's perfectly reasonable and this is already an example that pushes the limits of Bazel in other ways.

If it proves to be too much, I can imagine storing some sort of Merkle tree hash, but why be complicated if simple does it?

I don't understand your "finish reading all files before determining that anything has changed" comment. In order to verify a cache hit, you have to verify everything anyways, don't you? It's true that using anything else than the primitive "list of individual fles" approach would require more work before determining that it's a cache miss, though, maybe that's what you mean?

lberki avatar Feb 14 '24 09:02 lberki

Yeah, I also think we'll have to have one record for each file under the directory, and with https://github.com/bazelbuild/bazel/issues/21044, hopefully the performance won't be that bad?

meteorcloudy avatar Feb 14 '24 14:02 meteorcloudy

@ismell If the performance does turn out to be an issue, do you think chromeos can work around this be only watch a particular directory to reduce the size of watched files?

meteorcloudy avatar Feb 14 '24 14:02 meteorcloudy

FWIW, this is our current digest generation code.

It essentially boils down to:

$ find -L . -iname .git -prune -o ! -type d -exec sha256sum '{}' + | sort -k 2 | sha256sum

We have a todo to switch over to just running stat instead of sha256sum on each file:

$ find -L . -iname .git -prune -o -print0 | sort -z | xargs -0 stat | sha256sum 

So, for our use case, I don't think you necessarily need to store all the individual files in the MARKER file, but just an easily computed hash.

@ismell If the performance does turn out to be an issue, do you think chromeos can work around this be only watch a particular directory to reduce the size of watched files?

Unfortunately we need to watch all the directories. A file addition, deletion, or modification to any of the portage overlay directories needs to retrigger our repository rule. The workaround we implemented (which this issue is trying to remove) is to have a global CACHE_BUST_DATE env variable that our tools/bazel script sets to date. This forces the @portage_digest code to always run. We use the generated digest file as an input into the @portage repository so it gets re-generated when any of the "watched" files change.

And potentially to exclude some paths (I guess that's a P2)

We might be able to work around that. We know where the .git directories are located, so we could manually traverse the first layer:

for overlay in ["src/private-overlays",...]:
    for d in rctx.readdir(overlay):
        if d.name == ".git":
            continue
        if d.is_dir:
            rctx.watch_dir(d, recursive=True)
        else:
            rctx.watch(d)

ismell avatar Feb 14 '24 17:02 ismell

I don't understand your "finish reading all files before determining that anything has changed" comment. In order to verify a cache hit, you have to verify everything anyways, don't you? It's true that using anything else than the primitive "list of individual fles" approach would require more work before determining that it's a cache miss, though, maybe that's what you mean?

Indeed, I was thinking about the cache miss case. But you're right that cache hits will always need digesting everything anyway, so I imagine just storing an everything hash should be fine.

I assumed that watch_dir() would mean "directory and the files in it recursively", i.e. (conceptually) find . -type f | sort | xargs sha256sum | sha256sum.

Yeah, so I'm now leaning towards introducing a distinct API for this, instead of tacking it on to rctx.watch_dir(recursive=True). While we're there, I'm also considering renaming rctx.watch_dir() to something that makes it clear that it only watches the dirents, and nothing more. Or maybe I'll just tack the watching onto path.readdir() (maybe with the yes/no/auto thing again), which seems natural.

This recursive shasum thing you described would be a separate API, let's say rctx.watch_tree() or some such. That still seems good to me.

We have a todo to switch over to just running stat instead of sha256sum on each file:

At this point, my concern is starting to become "this is looking like a very chromeos-specific API". I'm a bit worried about overfitting. Do we think "watching the stat() of all transitive children under a tree" is a generally useful API that we'll get a second user? I'm a bit hesitant to create something that only ChromeOS will ever use (but not staunchly against it, full disclosure).

Wyverald avatar Feb 14 '24 18:02 Wyverald

Is running stat instead of sha256sum a goal or is it just a performance optimization with the tradeoff that sometimes incorrectness is accepted?

@ismell 's desire was to not re-fetch the repo if the contents of a file changed but its mtime didn't, that would be indeed highly specific to Chrome OS and not very useful anywhere else, but it'd be quite a strange thing to want.

lberki avatar Feb 14 '24 18:02 lberki

My guess is that it's just an optimization (false cache misses are acceptable). Which is honestly generally true for repo rules -- we never said "you're guaranteed to not be refetched unless X happens"; the user could always bazel clean --expunge which results in the ultimate cache miss, if you will. IOW, we could treat all repos as local=True and no contract will have been violated (although lots of people will be very unhappy).

So that likely means we can still say "rctx.watch_tree() checksums everything" but secretly only checksum the mtimes. That sounds good to me. @lberki, I'd love to hear your thoughts on my previous comments about renaming watch_dir and splitting off watch_tree.

Wyverald avatar Feb 14 '24 19:02 Wyverald

(caveat: my last comment assumes "file content changes imply mtime changes" which I realize isn't technically true. But let's assume that's true for that specific argument...)

Wyverald avatar Feb 14 '24 19:02 Wyverald

Is running stat instead of sha256sum a goal or is it just a performance optimization with the tradeoff that sometimes incorrectness is accepted?

It was just a performance optimization at the cost of re-running the repository rule if a file is touched.

ismell avatar Feb 14 '24 22:02 ismell

Wouldn't expanded archives with fixed mtimes break the optimization, similar to what used to be broken with https://github.com/bazelbuild/bazel/issues/14723?

brentleyjones avatar Feb 15 '24 04:02 brentleyjones

They would (in fact, @tjgq and myself just ran into this yesterday). This is why I listed mtime, ctime and inode number in my comment above when calculating the space requirements for the lockfile.

lberki avatar Feb 15 '24 08:02 lberki

Do we think "watching the stat() of all transitive children under a tree" is a generally useful API that we'll get a second user?

I think maybe we can use it to watch the cc toolchain directory, so that local_config_cc is automatically fetched when the system compiler is updated?

meteorcloudy avatar Feb 15 '24 10:02 meteorcloudy

For the cc toolchain use case, isn't what we are ultimately interested in the contents? (it looks like @ismell says that hat's the case for them, too, so I think the point is moot)

lberki avatar Feb 15 '24 11:02 lberki

A fix for this issue has been included in Bazel 7.1.0 RC1. Please test out the release candidate and report any issues as soon as possible. Thanks!

iancha1992 avatar Feb 22 '24 21:02 iancha1992