bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Support internal dependencies that are resolved on-disk via another tool

Open vdye opened this issue 3 years ago • 49 comments

Description of the feature request:

Add "filesystem adapter" API(s) so that when a source file isn't found on-disk during the loading phase, Bazel can fall back on a user-specified function to vivify the file (if it exists) rather than throwing an error.

This request comes from a desire to have Bazel work in a Git sparse-checkout environment without pre-specifying patterns (see below), but the feature probably shouldn't be scoped to just Git; for example, Mercurial has a similar sparse checkout feature. I'd be happy to see a feature like this be even more general purpose if there are other cases it could support!

What underlying problem are you trying to solve with this feature?

Internal dependencies in Bazel are implicitly required to be on-disk when performing Bazel operations (query, build, etc.) that involve loading the associated source files. However, in a Git sparse-checkout environment, source files are not present in the worktree (i.e., where Bazel is looking for them) unless vivified with git sparse-checkout add. Even when not on-disk, though, these source files are more consistent with internal dependencies (fixed to the same "version" as the rest of the repo, not a separate pre-built package, are checked out to source directories) than external ones.

Right now, the only way to work in a sparse-checkout with Bazel is to specify the files that a particular build will need before running bazel build. A user has to maintain two independent sources of truth for which files are needed in a build: the Bazel dependency graph, and the sparse-checkout patterns. If the not-on-disk sparse-checkout files can reasonably be interpreted as "internal dependencies" without running contrary to Bazel's underlying principles, it would be extremely helpful to users if they could resolve the files "just-in-time" while Bazel builds its internal dependency graph.

Which operating system are you running Bazel on?

No response

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

Within Bazel itself, it looks like the only other reference to sparse-checkout is a feature request to include a static pattern list in the specification of a git_repository(). That request doesn't quite match this one, since 1) that deals with an external repository, rather than the local one, and 2) it still requires pre-determined patterns, rather than resolving files just-in-time.

Outside of Bazel, third-party Bazel/sparse-checkout interoperability tools are pretty common, especially among large monorepo projects (1, 2). However, these tools still rely on a fully on-disk copy of a Git repository to generate the Bazel dependency graph, severely inhibiting the potential performance gains of git sparse-checkout. And, because these integrations use a tool separate from either git or bazel, they are potentially more difficult to adopt than an integration contained in a Bazel rule library.

Any other information, logs, or outputs that you want to share?

I know this is a pretty large feature request (with lots of details to consider: UX, performance, thread safety, etc.), so if this request doesn't seem completely infeasible, I'm happy to write & submit a more detailed design doc!

vdye avatar Oct 03 '22 20:10 vdye

To manage expectations here, it would take very good arguments for this bug to be fixed (and the associated design doc to be approved)

The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together and this would be both a pretty big change from that approach and a pretty large chunk of functionality for us to support indefinitely that is, in light of that approach, redundant.

@meteorcloudy WDYT? I could imagine considering merging this if there was a widespread need, but for one use case, I'd be reluctant to take on the increased mental burden and support load.

lberki avatar Oct 18 '22 13:10 lberki

I agree with your judgment, but I think this is more of a decision to make for the core team.

meteorcloudy avatar Oct 18 '22 13:10 meteorcloudy

Then @haxorz WDYT? I thought @meteorcloudy is appropriate because he has all the state about external dependencies in hi head, which is (partially) relevant.

lberki avatar Oct 19 '22 06:10 lberki

Hello! I'd like to add a bit of context around why I submitted this, what I'm looking for, etc. w.r.t your comments.

For some background: while I'm new to Bazel, I contribute pretty actively on Git (especially git sparse-checkout and the associated sparse index feature) and am generally interested in Git's scalability. Sparse-checkout is great for large repo performance, but since Git isn't a dependency manager (relevant Git mailing list discussion), file inclusion patterns need to be specified manually. Using Bazel with sparse-checkout is pretty common in large monorepos - they both lend themselves to the performance needs of such repos - so using the Bazel dependency graph to construct sparse-checkout patterns is somewhat common.

I spent some time over the past year interviewing monorepo maintainers to learn what their biggest pain points were in Git, and Bazel came up in nearly every conversation. I learned that 1) everyone seems to build their own custom, internal tool to integrate Bazel and sparse-checkout with no standardized approach, and 2) even with those tools, sparse-checkout performance in those repos was substantially worse[^1] than it should be because of the need to have files on disk before running bazel build. So, my goal with this proposal is to find a way to address both of those issues.

The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together

I did see some projects related to this approach (namely sandboxfs) but ultimately leaned away from that sort of integration. Historically, FUSE hasn't been a feasible option for interacting with Git on the scale of repos that need the performance gains of sparse-checkout due to speed, scalability, and/or ease-of-use. Microsoft rolled their own Git-aware VFS with VFS for Git with a greater focus on perf, but that had platform compatibility issues and is now in maintenance mode as a result.

this would be both a pretty big change from that approach and a pretty large chunk of functionality for us to support indefinitely that is, in light of that approach, redundant.

That's an understandable concern, and it's frankly the main reason I spent months evaluating alternative approaches before submitting this issue. But the overhead (both computational and mental) from forcing the use of a VFS layer makes it impractical as a solution in my eyes.

I could imagine considering merging this if there was a widespread need, but for one use case, I'd be reluctant to take on the increased mental burden and support load.

Multiple large projects that use Bazel have built entire engineering systems around coordinating it with Git (there were three talks at Git Merge this year about them! 1, 2, 3). While it might seem like a limited use-case, it's an increasingly common - and critical - one.

Ideally, solving this problem by extending Bazel would make it easier for projects to standardize on a common open-source approach to integrating with Git. A well-designed implementation should minimize the support load added Bazel, while also removing the distributed mental/support burden for all of those other projects' custom tools. Plus, it would lower the barrier to entry for using a performance-optimized Git + Bazel setup for the projects that don't have a dedicated engineering tooling team.

Anyway, thanks for the feedback and I hope this context helps!

[^1]: The two main causes for the performance degradation are the tracking of files on-disk and the use of a full, rather than sparse, index. For example: testing on my local machine with a repo of ~250k files, the timing of git reset --hard was: 3.45s with all 250k files on disk, 0.77s with a full index but only ~17k files on disk, and 0.34s for ~17k files + sparse index enabled. As repositories get larger, the performance impact increases; this blog post does a good job of explaining what's going on under the hood.

vdye avatar Oct 19 '22 19:10 vdye

Wow, that's an impressive amount of research!

Does the sentence "Bazel came up in nearly every conversation" imply that a majority of the folks who have huge git repositories use Bazel?

My heart is saying that we should have some sort of functionality like this in Bazel, but my head is very reluctant for the reasons already elaborated above. I was about to ask who exactly this would benefit, but you YouTube links more than adequately answer this.

What I will do now is to ask around at Google:

  1. If there are any folks who would benefit from this functionality (then it's a bit easier sell, although the same questions about possible alternatives apply)
  2. If anyone knows of any alternatives

The two alternatives that I know if is a FUSE-based approach we discussed above and a recursive bazel query like bgsc does and what the Twitter folks have.

For the latter, I am somewhat skeptical of the claim that bazel query is slow; at Google, we are using it on pretty large code bases and as long as one can keep the Bazel server alive, it's quite snappy. Do you have a way of proving or disproving these claims?

lberki avatar Oct 21 '22 12:10 lberki

[@lberki ] Wow, that's an impressive amount of research!

+1. @vdye, thank you for the incredibly thorough and thoughful FR!

@michaeledgar has a lot of prior experience with sparse checkout in both Git and Mercurial and he's coincidentally on team-Core, so I'd like him to take a look and give his technical opinion.

[@vdye ] ... A well-designed implementation should minimize the support load added Bazel...

+1 to this. Ideally we can come up with something that doesn't involve intrusive changes to Skyframe (Bazel's core execution and incrementality engine). I fear this might be challenging though. But my fear is partially based on ignorance of the details of spare checkouts, so maybe Mike can think of a concrete ~API that is both tasteful and feasible to support.

I'd also like to emphasize the inverse statement: If this FR is to be reified as something very tightly coupled to Git, I don't think team-Core is the right owner (as the lead of team-Core, I wouldn't feel comfortable maintaining it). Also [, pretending I were Bazel's Product Manager, ] I don't think we should add such tight coupling to specific VCS.

haxorz avatar Oct 21 '22 18:10 haxorz

Without taking sides, @vdye 's design doc is not coupled to git, so on that side, we're good.

I thought a bit about how this could look like in Skyframe and I think I could come up with a way to make this as unintrusive as possible to Skyframe (essentially: make PackageFunction and BzlLoadFunction fetch the files they care about, or maybe PackageLookupFunction), if git's sparse checkout API is such that it can be done efficiently.

The issues I immediately see are:

  • Symlinks in packages (their targets need to be fetched, too)
  • Fetching a package but not its subpackages (which are delineated by BUILD files, but then how does know whether a BUILD file exists without fetching it?)
  • Fetching the transitive load() dependencies of a package (this may be simple; I didn't think about it that much)

lberki avatar Oct 21 '22 18:10 lberki

cc @maxious @arxanas @dbernadett @linzhp

lberki avatar Oct 25 '22 07:10 lberki

👋 One of those monorepo maintainers that wrote their own sparse checkout on bazel chiming in.

I am vaguely aware of the google virtual file system (srcfs?) and the way you can implement extended file attributes for hashes. Unfortunately the place where we would want to use this most (developer laptops on macos) has been very shaky in terms of FUSE support/UX in recent years (eg. https://github.com/bazelbuild/sandboxfs/issues/46 ) and even cloud storage providers struggle with the lack of lazy loading with native MacOS remote file system support Recently there has emerged the approach of a local NFS server in place of FUSE which I note Buildbarn has adopted so maybe there's a path forward soon?

All that being said, you still need some service (Content Addressable Storage?) network accessible somewhere and this is easier said than done. Securing and scaling these services takes resources and we already have to consider remote caching and remote execution with bazel in our ecosystem - adding a FUSE file system backend to maintain would be another challenge at less-than-google scale. Hence why I would be supportive of this proposal - in fact we have contributed support for a similar concept to bazel-diff "Allow user to provide their own content hash to minimise IO operation" In our case git would probably make sense as the source of truth - bonus that the hashes are stored locally! Git doesn't have to be the only option by far - the interface between bazel and the filesystem adapter API can be much smaller than having to implement a whole FUSE/NFS implementation that has to be a general purpose filesystem. This would make it easier for the community and vendors to provide practical solutions

maxious avatar Oct 25 '22 23:10 maxious

The current Bazel way is to put the source tree on a FUSE file system and implement laziness that way; this makes it possible for Bazel to be completely agnostic as to how the source tree is put together

I did see some projects related to this approach (namely sandboxfs) but ultimately leaned away from that sort of integration. Historically, FUSE hasn't been a feasible option for interacting with Git on the scale of repos that need the performance gains of sparse-checkout due to speed, scalability, and/or ease-of-use. Microsoft rolled their own Git-aware VFS with VFS for Git with a greater focus on perf, but that had platform compatibility issues and is now in maintenance mode as a result.

Hi @vdye,

Big fan of your work over the Git Dev mailing list!

I think https://github.com/bazelbuild/proposals/pull/277 is quite solid and I am in favor of it personally.

However, I would like to note that FUSE usage with Bazel does provide a bit more than just "lazy file loading", which I believe to be the core thesis of git-sparse-checkout and partial-clone(filtered clone) today Perhaps this video from 2sigma folks should help provide some additional insight.

https://www.youtube.com/watch?v=rQd9Zd1ONOw

In short, Bazel relies heavily on the hash of input files in the source tree. Using these hash, Bazel builds up a Merkle tree of dependency graph and relies on it to schedule different build actions. Traditionally, to obtain the hash of files in the source tree, you would need to download the blob to ... hash the content. However, using FUSE, the server operator could provide the pre-compute hash of the file blob at check-in time (or push-received time). This is then provided to Bazel client side via getxattr() call over the FUSE mount. This allows Bazel to obtain the hash of input files without having to download the entire file, saving a lot of network traffic. See #11662 and build-in-cloud-distributing-build-steps (2011) for more information on this.

There are 2 problems trying to re-construct this using git-sparse-checkout today:

  1. In --cone mode, which is now the default mode, all files in an included directory are downloaded. Ideally, we should be able to specify git to only download the BUILD.bazel files and create some customization in Bazel to look into the current commit's tree objects to find the remaining files and their hashes. This could be done with a lot of customization, but definitely is not easy and require Bazel to know a lot more about git and sparse-checkout.
  2. Current default hash algorithm in Bazel is SHA256 with configs to move up to SHA512 in the future. Today, git is still using SHA1 as the default hash algorithm. The initial support for SHA256 does exist in git, but lack of migration/transition support for existing repositories and support from popular hosting software/vendors such as Github, Gitlab, Bitbucket etc... Having the version control system and the build system using 2 different hash algorithms make the integration between the 2 systems a lot harder than it should.

Edit: I wrote this last night and forgot to hit send. I see that @maxious has raised similar concerns in the post above mine 🤝

sluongng avatar Oct 26 '22 09:10 sluongng

I'm not (currently) a bazel user, and am coming more from the Git side (I'm one of the other Git developers working on sparse-checkout capabilities), but I'd like to throw out another idea for folks to consider:

Add "filesystem adapter" API(s) so that when a source file isn't found on-disk during the loading phase, Bazel can fall back on a user-specified function to vivify the file (if it exists) rather than throwing an error.

Idea:

Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files?

Reasons:

One thing I was disappointed in with the Git Merge monorepo talks this year was that folks were vivifying these files (or using an additional checkout) rather than allowing the build system to be provided the file contents directly. Granted, vivifying files is what I did for our first cut at using sparse checkouts as well, but we found it rather suboptimal. Providing contents directly to the build system has multiple advantages:

(1) The working directory is kept nice and tidy both for the user and for IDEs -- either of which can get confused by the extra files and directories appearing (perhaps transiently). (2) Providing dependency content directly to the build systems works much more nicely with sparse checkouts: cone mode or the sparse index often mean that vivifying one file means vivifying all files within or underneath a directory, recursively. That's wasted effort when we just want a small control file with dependency information. (3) Providing dependency info directly to the build system reduces I/O (no need to round-trip the dependency information through the filesystem, in addition to no need to write all the extra files in the last item).

I will note that you can work around at least the first issue by e.g. keeping a parallel copy of the checkout maintained; this requires a little cleverness to avoid the performance issues of a full second copy (which is doable), but much more importantly you're introducing the risk of the parallel copy being wrong due to not having local changes made by the user in their working copy (or users going and messing with that copy). Providing the appropriate file contents directly to the build system avoids this problem as well as the above.

I've implemented the contents-directly-to-build-system idea I suggested above. (To get the contents of files just-in-time, for SKIP_WORKTREE files I make use of git cat-file --batch and feed the results to the build system). It has worked great for us, but we are not using bazel so there's not much I can share beyond the high-level experience. Luckily for us, the code changes needed were fairly localized, but it's not clear whether the same would be true for bazel. But, if the API to allow folks to vivify build control files in a just-in-time manner is not much more work than allowing folks to just provide bazel with build control file contents in a just-in-time-manner, then it'd be nice to support the latter too; it makes things much nicer.

newren avatar Nov 01 '22 01:11 newren

Thanks everyone for the commentary/discussion!

@newren: I noted in the full proposal that most Bazel rules (like cc_binary()) rely on having source files on disk, and it would be a massive breaking change to force all of them to accept in-memory source files (some likely couldn't do that even if they wanted to). But what you're suggesting makes a distinction between the "build files" (BUILD, WORKSPACE, *.bzl) needed to construct the dependency graph and "source files" (*.cpp, etc.) used to build the outputs of the project. Separating the needs of those two groups presents some really interesting possibilities!

As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk. With those two assumptions in mind, only core Bazel (not individual custom rules) would need to be able to to access build file content from somewhere other than the filesystem (e.g. using git cat-file). With that capability, the complete dependency graph could be built, cached, and queried without either 1) vivifying more files than are needed, or 2) encountering a "package not found" error. The results of bazel query should be all a user needs to create the sparse patterns; a sparse-checkout could then be set up with either an external script or custom Bazel rules/macros (using genquery() to get the dependency graph, then a custom rule to apply the sparse-checkout patterns(?)).

It's probably still more appropriate to add this functionality via an extension API vs. putting git cat-file invocations into core Bazel, but overall I think I prefer it over my original approach. Outside of the infrastructure for a new API & contexts, it's much narrower in scope; these changes should be (mostly) isolated to PackageLookupFunction/PackageFunction/PathPackageLocator(???), and dealing with only build files (rather than the entire source tree) alleviates a lot of my concerns with locking in git sparse-checkout & thread safety. That said, there'd be some trickiness to how the build files are retrieved from Git (to avoid expanding a sparse index more than necessary), and the potential issues @lberki raised earlier also still apply. I need to ponder those a bit more, but my gut feeling is that they can be handled without any invasive architectural changes.

@lberki @michaeledgar (or anyone, really) - what do you think? I'd like to start working on an implementation, but don't want to get too far with it in case there's a much easier approach that hasn't come up yet.

vdye avatar Nov 01 '22 17:11 vdye

As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk.

I think there's a problem when globbing source files, since then the rules depend on the presence of source files on disk (even if not their contents)?

arxanas avatar Nov 01 '22 17:11 arxanas

But what you're suggesting makes a distinction between the "build files" (BUILD, WORKSPACE, *.bzl) needed to construct the dependency graph and "source files" (*.cpp, etc.) used to build the outputs of the project. Separating the needs of those two groups presents some really interesting possibilities!

Yes, absolutely. It only made sense to handle the "build files" that special way; the "source files" were absolutely required to be on disk for the other build system too.

One thing to note, though, is that our build system was okay with just providing {module/package}-level dependencies, which was what we needed to get the directories to specify for the sparse-checkout configuration. If we weren't using cone mode and were attempting to exclude all unnecessary files, or if the system had been hardcoded to only determine file-level dependencies (which is often determined by globbing all source files under a directory), then our changes to the build system would have needed to be more invasive. I think it's a bad idea to try to force sparse-checkouts into the minimum list of needed files and much prefer the idea of the minimum set of packages, but the question of package-level vs file-level dependencies might come up so I thought I'd note it.

It's probably still more appropriate to add this functionality via an extension API vs. putting git cat-file invocations into core Bazel

:+1:

newren avatar Nov 01 '22 18:11 newren

As far as I can tell, source files aren't needed on-disk until the "Execution" phase of the build, and external rules don't directly access the build files on disk.

I think there's a problem when globbing source files, since then the rules depend on the presence of source files on disk (even if not their contents)?

I think this goes back to the question I also noted above: is it possible to determine package-level dependencies without determining all file-level dependencies? We only need the former for sparse checkouts.

newren avatar Nov 01 '22 18:11 newren

Update: we talked with @linzhp in person at BazelCon and I offered him the option of building a bazel query operator which could be used to do the following:

  1. One identifies the top-level targets one wants to build
  2. One calls bazel query magicdeps(:target, 1) to identify the packages that need to be on disk to be able to load the direct dependencies of :target
  3. One checks out these packages using git sparse checkout
  4. One calls bazel query magicdeps(:target, 2)to identify the packages that need to be on disk to be able to load the second-level dependencies of:target`
  5. And so on, until a fixed point is reached.

I don't think this is doable with bazel query now, but it's certainly possible to build (maybe not with bazel query)

This system could be used to do sparse checkouts with Bazel without deep integration like what @vdye proposed. The user experience would not be as nice, but, it would widen the interface of Bazel much less than that proposal and it would be a much smaller ongoing maintenance burden.

Would this be an option?

lberki avatar Nov 16 '22 19:11 lberki

Would this be an option?

This approach is almost identical to one of the ones I considered before opening this issue ("create a mode/option allowing bazel query to print out dependencies that are declared by a resolved target, but cannot be resolved themselves"). I decided against continuing down that path because it loses one of the core tenets of this proposal, which is that a user shouldn't need external tooling (in this case, a script to call bazel query magicdeps(...) & sparse-checkout iteratively) to manage the coordination between Bazel and Git. What you're suggesting may be a valuable feature, but it doesn't solve the developer workflow/ease-of-use problem.

In the meantime, I've been working on a draft implementation based on the alternative approach suggested by Elijah earlier. With respect to maintenance burden, I haven't had to do much in the way of refactoring or changing core infrastructure; the virtual file resolution is limited to a new SkyFunction + some conditionals in PackageLookupFunction and PackageFunction, plus some boilerplate-y Starlark builtin setup for dependency_adapter. I've been gradually whittling down the scope in the process (not using an adapter when loading packages in external repositories, the implementation is expected to resolve virtual symlinks if there are any, no setup/teardown functions, etc.). I'd like to at least get to a basic end-to-end working demo/draft PR so that it can be evaluated concretely, probably in the next couple weeks.

vdye avatar Nov 16 '22 20:11 vdye

@vdye : yeah, it's not a great idea I had, it's a preference of mine between two of your ideas :) I do realize that it would not be as nice a user experience, but it would be much easier to support indefinitely. (Thus spake the Master Programmer: "Though a program be but three lines long, someday it will have to be maintained.")

I'll resist the temptation to pass judgement on the approach you have taken because it's much easier to have an informed opinion when I see the code. From what I can glean from your above comment, it seems to work by teaching PackageFunction / PackageLookupFunction that the root of some packages is different than what is in --package_path, but then how is it not "vivifying the files"? How does it ask git for the data it needs?

Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative FileSystem implementations at Google, so it's an approach that certainly works.

On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package a, packages a/b, a/b/c and every package under a/ will come with it, which is potentially quite wasteful.

Only checking out BUILD / BUILD.bazel files for the packages one needs is not enough for the following reasons:

  1. BUILD files can load Starlark files
  2. Globs need to do readdir() and readlink()
  3. The meaning of package a can depend on the existence of BUILD files under it. E.g. the label //a:b/c is valid only if a/b/BUILD does not exist (if it does, the same file is called //a/b:c)

lberki avatar Nov 17 '22 12:11 lberki

To add to my previous comment: I can see how piggybacking on PackageLookupFunction and shelling out to an external binary (in some form) would be a solution. It has a number of issues, though, that are non-trivial to solve:

  1. How do you cache the non-existence of packages? When Bazel sees (for example) //a:b/c, it has to verify that a/b/BUILD does not exist or if one says bazel build //nonexistent:package, Bazel must learn somehow that nonexistent/BUILD and nonexistent/BUILD.bazel do not exist and a file system where that does not exist (but Bazel tried) looks the exact same as a file system where Bazel hasn't even tried to access that directory.
  2. What happens on a Bazel server restart? The issues in (1) are technically not a problem as long as the Bazel server is alive because Bazel caches the existence (or non-existence) of a package in RAM, but if you to bazel shutdown (or the server dies for any other reason), that information is lost.
  3. Symbolic links also need to be thought about, but it appears that you have a plan for them?
  4. What happens if you have two Bazel instances on the same output tree (if you pass --output_base to Bazel, you can do that and people do)
  5. How do you control parallelism? If you make PackageLookupFunction slow, it's more likely that there will be many of those in flight at any given moment, so this needs to be thought about even if the solution is "document that the invoked script must be able to run in multiple instances at the same time"
  6. What happens if one says filegroup(srcs=[<directory>]) and that directory happens to contain a BUILD file somewhere under it? Bazel then currently just puts everything under the directory (crossing BUILD files) on the inputs of actions in this case. Granted, this use case is one that Bazel does not support very well at the moment and in fact leads to incorrect behavior but we want to support it eventually in some form and it would be nice if this feature didn't add to the cost of that one. Like in the parallelism case, I'd be fine with any quick solution that doesn't make the existing problem any worse. Hell, I could be even convinced to ignore this one because we've been ignoring it for almost a decade.

lberki avatar Nov 17 '22 18:11 lberki

@lberki

yeah, it's not a great idea I had, it's a preference of mine between two of your ideas

I wasn't looking to take credit, just clarify that I'd deeply considered the pros/cons of what you suggested and note the conclusion I had reached.

it would be much easier to support indefinitely.

I detailed earlier why third-party tools are insufficient from a usability perspective. Tools that do what you’re suggesting already exist (most similarly, bgsc), but this proposal is explicitly meant to move beyond their limitations. While Google's workflow may not have a use for it, there are plenty of other Bazel users (including some of the people in this thread!) with completely different infrastructure & workflows that would.

You’ve mentioned maintainability a number of times now, but it would be extremely helpful for me to know what you would quantify as “too complex” or “too invasive.” Are there specific things I need to avoid doing in my implementation that have burned the Bazel team in the past? I’d be much more comfortable working within clear guidelines than leaving the “go/no go” call up to (from my perspective) chance.

it seems to work by teaching PackageFunction / PackageLookupFunction that the root of some packages is different than what is in --package_path, but then how is it not "vivifying the files"? How does it ask git for the data it needs?

Per Elijah's comment:

Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files?

So it's not vivifying the files, it's reading the files directly from the Git index into memory for Bazel to compile (via git cat-file). As I have it now (which, keep in mind, is a work-in-progress), PackageFunction sends a request to PackageLookupFunction, but denotes that the BUILD file may be virtual (via a new, default-false field in the SkyKey). If it is, and PackageLookupFunction does not find a particular BUILD file on-disk, it sends a request to the new VirtualBuildFileFunction, which first loads & checks for the existence of the dependency_adapter(), then invokes the implementation function, then compiles the returned BUILD file contents and stores the CompiledBuildFile in a SkyValue. PackageLookupFunction uses those results to say "hey this package does exist (but it's virtual)!", then PackageFunction uses the CompiledBuildFile from the result in place of reading from disk & compiling.

Note that this approach has two limitations: it can only run successfully in commands that don't touch the non-BUILD source files (so, bazel query), and it doesn't perform the on-disk source vivification. The user workflow would either be to pipe the results of of the appropriate bazel query to git sparse-checkout set, or write a parallel chain of rules (genquery() to get the dependencies, then genrule() to invoke sparse-checkout). Importantly, both avoid the need for an external script.

Like I said, though, all of this is still WIP, so nothing is set in stone. For example, one thing I'd like to try is to skip the request to VirtualBuildFileFunction altogether by default, and only enable dependency_adapter usage at all with a feature flag.

I currently have a functioning end-to-end MVP, but it almost certainly breaks outside of the “happy path” example I’m working with. I’m planning on spending a week (not next, but the one after) cleaning that up, then will open a draft pull request so people can look it over.

Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative FileSystem implementations at Google, so it's an approach that certainly works.

This approach would likely require hardcoding Git logic into Bazel (which I don't think anyone here wants) and the FileSystem layer is responsible for all filesystem accesses, so any integration would need to distinguish between in-repo file reads & everything else. I think that'd be far more invasive and complicated than anything I'm proposing here.

On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package a, packages a/b, a/b/c and every package under a/ will come with it, which is potentially quite wasteful.

Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it. Using your example, a cone mode repository like that would either intend for a, a/b, and a/b/c to always be needed together (and therefore checked out together), or would rework the directory structure to avoid what’s not needed.

Note that the dependency adapter proposal doesn’t mandate one of cone mode or non-cone mode (although it is designed with cone mode use cases in mind). A Git adapter would pull BUILD file information directly from the index without touching sparse-checkout patterns. There are some minor details about how to manage expanding a sparse index, but they’re very easily overcomeable.

Only checking out BUILD / BUILD.bazel files for the packages one needs is not enough for the following reasons:

  1. BUILD files can load Starlark files

  2. Globs need to do readdir() and readlink()

  3. The meaning of package a can depend on the existence of BUILD files under it. E.g. the label //a:b/c is valid only if a/b/BUILD does not exist (if it does, the same file is called //a/b:c)

Some of these (EDIT: and a few of the additional ones you mentioned in your most recent comment) aren't supported by my prototype in its current state, but 1) I intend to at least evaluate how much work they’d be and adjust my design accordingly, and 2) if they’re non-trivial to support, document why. Ultimately, I don't think it's unreasonable to say "the dependency_adapter() won't (yet) work if you're also using xyz feature," as long as that's balanced against supporting enough features to be generally useful.

I personally find it most beneficial to introduce and incrementally expand a capability over time, rather than to wait just to make sure it's compatible with every possible workflow. That's especially true here, as the feature is opt-in and, if a user tries something and it doesn’t work, that can inform where incremental improvements could be made. In the meantime, users could either adjust their build infrastructure to fit the restrictions, or continue working without the feature at all.

vdye avatar Nov 17 '22 21:11 vdye

Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it.

I'd echo this sentiment. We already enforce conventions against patterns in Bazel (and other tools) that are incompatible with sparse checkouts.

maxious avatar Nov 17 '22 21:11 maxious

@lberki

yeah, it's not a great idea I had, it's a preference of mine between two of your ideas

I wasn't looking to take credit, just clarify that I'd deeply considered the pros/cons of what you suggested and note the conclusion I had reached.

Neither did I so I thought it's better to clarify. It's all good!

it would be much easier to support indefinitely.

I detailed earlier why third-party tools are insufficient from a usability perspective. Tools that do what you’re suggesting already exist (most similarly, bgsc), but this proposal is explicitly meant to move beyond their limitations. While Google's workflow may not have a use for it, there are plenty of other Bazel users (including some of the people in this thread!) with completely different infrastructure & workflows that would.

You’ve mentioned maintainability a number of times now, but it would be extremely helpful for me to know what you would quantify as “too complex” or “too invasive.” Are there specific things I need to avoid doing in my implementation that have burned the Bazel team in the past? I’d be much more comfortable working within clear guidelines than leaving the “go/no go” call up to (from my perspective) chance.

I can't think of an objective measure, unfortunately. If I had, I would have already informed you about it; what I'm looking for is some combination of:

  • Simple interface (so that we have less user support load and users can understand what's going on easier)
  • Simple implementation (so that there is less place for bugs to lurk in)
  • No surprising interactions with other features

Does this help?

it seems to work by teaching PackageFunction / PackageLookupFunction that the root of some packages is different than what is in --package_path, but then how is it not "vivifying the files"? How does it ask git for the data it needs?

Per Elijah's comment:

Would there be an opportunity to just provide Bazel with the contents of the files it needs for computing dependencies instead of vivifying the files?

So it's not vivifying the files, it's reading the files directly from the Git index into memory for Bazel to compile (via git cat-file). As I have it now (which, keep in mind, is a work-in-progress), PackageFunction sends a request to PackageLookupFunction, but denotes that the BUILD file may be virtual (via a new, default-false field in the SkyKey). If it is, and PackageLookupFunction does not find a particular BUILD file on-disk, it sends a request to the new VirtualBuildFileFunction, which first loads & checks for the existence of the dependency_adapter(), then invokes the implementation function, then compiles the returned BUILD file contents and stores the CompiledBuildFile in a SkyValue. PackageLookupFunction uses those results to say "hey this package does exist (but it's virtual)!", then PackageFunction uses the CompiledBuildFile from the result in place of reading from disk & compiling.

Note that this approach has two limitations: it can only run successfully in commands that don't touch the non-BUILD source files (so, bazel query), and it doesn't perform the on-disk source vivification. The user workflow would either be to pipe the results of of the appropriate bazel query to git sparse-checkout set, or write a parallel chain of rules (genquery() to get the dependencies, then genrule() to invoke sparse-checkout). Importantly, both avoid the need for an external script.

Like I said, though, all of this is still WIP, so nothing is set in stone. For example, one thing I'd like to try is to skip the request to VirtualBuildFileFunction altogether by default, and only enable dependency_adapter usage at all with a feature flag.

I currently have a functioning end-to-end MVP, but it almost certainly breaks outside of the “happy path” example I’m working with. I’m planning on spending a week (not next, but the one after) cleaning that up, then will open a draft pull request so people can look it over.

Ah, got it; I misunderstood that, but now I am confused: you do have a point that the magicdeps() approach I offered is not a very great user experience because then running bazel build is not enough to actually build software because then the wrapper using magicdeps() also need to be called, but AFAIU this approach is only usable for bazel query, so it makes the arguably most important user workflow (bazel build) more complicated, too, by requiring a wrapper to check out the actual source files anyway.

Even limiting ourselves to bazel query, can this approach ever support globs? Globs need to call readdir() during package loading.

Bazel has a pretty well-developed virtual file system layer, which sounds like another possible abstraction layer to plug this in. I don't know if it's easier, but it's certainly a possibility and we have a number of alternative FileSystem implementations at Google, so it's an approach that certainly works.

This approach would likely require hardcoding Git logic into Bazel (which I don't think anyone here wants) and the FileSystem layer is responsible for all filesystem accesses, so any integration would need to distinguish between in-repo file reads & everything else. I think that'd be far more invasive and complicated than anything I'm proposing here.

I agree, but if we don't want the vivifying approach, I'm not sure if the "plumb the bits directly from git to Bazel" approach is workable at all (globs + the source files are eventually necessary for the build) without dropping down to the FileSystem layer.

On the git side, what is the future of non-cone mode? Will it be supported? From a brief perusal of the documentation, it looks like that cone mode is now preferred, which is unfortunate because then AFAIU if one wants to checkout e.g. package a, packages a/b, a/b/c and every package under a/ will come with it, which is potentially quite wasteful.

Repositories that need the scalability of cone mode are going to tend towards rearchitecting their package structure to best take advantage of it. Using your example, a cone mode repository like that would either intend for a, a/b, and a/b/c to always be needed together (and therefore checked out together), or would rework the directory structure to avoid what’s not needed.

Note that the dependency adapter proposal doesn’t mandate one of cone mode or non-cone mode (although it is designed with cone mode use cases in mind). A Git adapter would pull BUILD file information directly from the index without touching sparse-checkout patterns. There are some minor details about how to manage expanding a sparse index, but they’re very easily overcomeable.

I understand that; what I wanted to know is whether I can expect non-cone mode to be supported in git in the long term because if not, we'd have to just accept that a/b will always be checked out if a is. From the documentation, it seemed that sparse checkouts are still experimental and non-cone mode is doubly experimental and it looked like non-cone mode is necessary to support the above a but not a/b use case. At least @newren 's comment implies that not everyone arranges their source trees in such a way that if a is needed, a/b also is.

Only checking out BUILD / BUILD.bazel files for the packages one needs is not enough for the following reasons:

  1. BUILD files can load Starlark files
  2. Globs need to do readdir() and readlink()
  3. The meaning of package a can depend on the existence of BUILD files under it. E.g. the label //a:b/c is valid only if a/b/BUILD does not exist (if it does, the same file is called //a/b:c)

Some of these (EDIT: and a few of the additional ones you mentioned in your most recent comment) aren't supported by my prototype in its current state, but 1) I intend to at least evaluate how much work they’d be and adjust my design accordingly, and 2) if they’re non-trivial to support, document why. Ultimately, I don't think it's unreasonable to say "the dependency_adapter() won't (yet) work if you're also using xyz feature," as long as that's balanced against supporting enough features to be generally useful.

I personally find it most beneficial to introduce and incrementally expand a capability over time, rather than to wait just to make sure it's compatible with every possible workflow. That's especially true here, as the feature is opt-in and, if a user tries something and it doesn’t work, that can inform where incremental improvements could be made. In the meantime, users could either adjust their build infrastructure to fit the restrictions, or continue working without the feature at all.

I promise I won't hold you to impossible standards :) What I'm looking for here is not a perfect implementation, only an implementation that has a reasonable chance of eventually supporting every use case. As long as that is true, it can always be put behind a --experimental_* flag.

Comparing various alternatives against the above simple interface / simple implementation / no surprising interactions standard:

  • I expect that the git cat-file approach would to one of these ways: it is either a bad user experience because it breaks some Bazel features (globs, bazel build, etc.) or its implementation would be quite unwieldy
  • The vivifying approach would be simple to implement, but would change the source tree as a side effect of Bazel commands
  • The FileSystem approach would either hard-code knowledge about git into Bazel or expose a quite complicated interface (if there is even a reasonable way to expose it without dropping down to Java)
  • magicdeps() is nice and orthogonal from the Bazel perspective, but changes every user interaction with Bazel by requiring a wrapper around Bazel.

lberki avatar Nov 17 '22 22:11 lberki

@vdye this thread is now long enough that I think a video call would be more productive. Which time zone do you live in?

I'm at BazelCon this week and my calendar is completely full, but we could arrange a call next week, which hopefully let us come to an agreement earlier. Until then, I guess a centithread it is?

lberki avatar Nov 17 '22 22:11 lberki

With all the talks in BazelCon around wrappers built on top of bazelisk and the plugin model they provide, couldn't we just use the magic method approach and do all the git dancing in there? That way we can all benefit from it while keeping the internals clean. Last year when we looked into partial cloning at Booking we found similar issues, having something like this available could really benefit us tbh

manuelnaranjo avatar Nov 17 '22 23:11 manuelnaranjo

With all the talks in BazelCon around wrappers built on top of bazelisk and the plugin model they provide, couldn't we just use the magic method approach and do all the git dancing in there? That way we can all benefit from it while keeping the internals clean. Last year when we looked into partial cloning at Booking we found similar issues, having something like this available could really benefit us tbh

Given that bazelisk is an "official" wrapper for Bazel, this might address some of my concerns about third party tooling (namely, ease of adoption & standardization). I haven't looked into the plugin model there, but it doesn't sound too complicated? I can try to quickly put something together to prove it out. Thanks @manuelnaranjo!

EDIT: actually, I have no idea where to start with bazelisk (I'm not seeing any official documentation on a plugin model). Does someone have a link to either an example or docs?

@lberki I'm not around next week, but I am the week after (time zone UTC−08:00). That said, I'd like to try this bazelisk idea out first, so we might be able to wait on setting up a call?

vdye avatar Nov 17 '22 23:11 vdye

@vdye they're built on top, they mentioned 2 in the conference https://github.com/aspect-build/aspect-cli and https://github.com/buddy-works/buddy-cli

manuelnaranjo avatar Nov 17 '22 23:11 manuelnaranjo

@vdye they're built on top, they mentioned 2 in the conference https://github.com/aspect-build/aspect-cli and https://github.com/buddy-works/buddy-cli

Hmmm, I must have misinterpreted then. I assumed the "plugin model they support" was referring to bazelisk supporting plugins, rather than the wrappers on top of it. That brings us back to "you need to manually install some third-party tool (in this case, the wrapper) from outside the Bazel or Git ecosystems," and all the drawbacks associated with that.

One thing I wanted to mention, since the bazelisk idea reminded me of it: if there was a way to use Bazel's existing extensibility APIs (custom rules or macros or some other kind of plugin) to define a magicdeps() iteration, I'd be all for doing this that way. But AFAICT that isn't possible (someone please correct me if I'm wrong), which is how I ended up with a proposal to add a new API.

vdye avatar Nov 18 '22 01:11 vdye

@manuelnaranjo From user experience's perspective, the magicdeps() approach is not better than the current approach we have at Uber (see my Git Merge talk). After each rebase/merge from the main branch, users still have to run a third tool to refresh their sparse work tree so all deps are checked out. The differences are:

  1. magicdeps() avoids maintaining a full checkout, which would save some disk space
  2. Because we don't need a full checkout, we don't need to replicate state from the sparse work tree, which can save some time
  3. Every time users want to refresh, the tool would need to run bazel query and git sparse-checkout set many times, depending on the number of layers in the deps.

3 could quickly eat up the time saved from 2. It's still nice to have, though.

Currently, there is another issue with magicdeps() approach: if any bzl file loaded by WORKSPACE is missing, bazel query would fail, even the bzl file is not needed by the target pattern given to bazel query. Not sure how hard it is to fix this.

linzhp avatar Nov 18 '22 03:11 linzhp

Yeah, I am aware of the limitations here and that it currently cannot be done with bazel query. It also has the problem that currently bazel query can't tell without erroring out which .bzl files are required by a BUILD file. But a magicdeps() version that works sounds like a reasonable and well-delineated addition to Bazel.

re: @vdye 's comment that this would then require a third-party tool outside of the Bazel and git ecosystem, I thought that there was no way around that: I don't want to embed knowledge about git into Bazel, you presumably don't want to embed knowledge about Bazel into git, that knowledge must live somewhere so a third place is the only option, isn't it?

@newren : what is suboptimal about the vivifying approach? If magicdeps() doesn't work, that one looks like a reasonably small feature which could be made to work (modulo symlinks), unlike the FileSystem approach (which as @vdye said, would be quite an undertaking to implement) and the git cat-file approach which AFAICT is very difficult to implement correctly, if possible at all.

lberki avatar Nov 18 '22 11:11 lberki

I've caught up with the entire thread! I want to try to get to everything --

  • WORKSPACE can't load because not all transitive .bzl files are present
  • Build/test target can't be loaded because not all transitive .bzl files are present
  • How to download fewer bytes

WORKSPACE can't load because not all transitive .bzl files are present

I don't think we can make do with an invalid workspace - I would treat the WORKSPACE contents and everything it references as "toolchain" source that simply must be present for Bazel to behave in a predictable manner.

I can followup and check if we produce BEP that could be used to determine which .bzl file was missing. Then you could maybe write a wrapper script that fetches the required repo based on the BEP output (the magicdeps() loop idea).

Build/test target can't load because not all transitive .bzl files are present

We currently need all the transitive .bzl files to be available, even if some of them are not relevant for the build. (For example, a BUILD file containing a foo_library(), foo_binary(), and foo_test(). Even if you only ask to build //foo:foo_bin, Bazel will load the foo_test symbol from a foo_test.bzl.) We should fail with an error in BEP - I'll followup on that too.

It's not true today, but it's conceivable that we could be tolerant to the failure to load rule definitions not in the transitive deps of the requested build targets. Typically it's easier to just partition unrelated rule definitions into parallel directory trees.

How to download fewer bytes

Google's monorepo tooling is famously based on virtual filesystems where tools may assume all files are available, and the filesystem lazily-loads the source content of unmodified files as needed. This is the way we download the fewest .bzl files during loading at scale, so that's the path of least resistance.

As has been discussed, to run bazel build //myteam:foo_bin the build tool may need to do filesystem operations outside of the myteam directory. It may read .bzl files, check for BUILD existence.

There's two broad categories of solutions here: FUSE and non-FUSE.

FUSE Solution

I'm not familiar with any non-Googley FUSE filesystems. But. If there is a read-only FUSE filesystem for your VCS, I think the following approach is essentially what Git users at Google used for many years. You use the VCS to manage the files you want checked out, and wrapper scripts maintain a symlink to a read-only FUSE filesystem that provides a consistent baseline snapshot of all other files. This solution looks like this:

  1. At boot, mount a read-only FUSE filesystem for the repo at /company/src, such that some path (eg. /company/src/$VERSION) yields the root tree of the repo at $VERSION.
  2. Change monorepo tooling to maintain a symlink at your workspace's root called REPO that points to /company/src/$VERSION.
  3. Add the REPO symlink to --package_path in your site blazerc:
$ readlink REPO
/company/src/a2604cdcce22f1337b582dfc093851d2015c0c15
$ cat .bazelrc
build --package_path=%workspace%:%workspace%/REPO
clean --package_path=%workspace%:%workspace%/REPO
fetch --package_path=%workspace%:%workspace%/REPO
info --package_path=%workspace%:%workspace%/REPO
query --package_path=%workspace%:%workspace%/REPO

Any BUILD/bzl files not found in the workspace will be retried under /company/src/a2604cdcce22f1337b582dfc093851d2015c0c15.

If you're patching together a bunch of repos, you can do this with multiple symlinks to different mount points, though they will be checked sequentially, so loading performance may degrade if the additional failed FUSE filesystem reads are slow.

I'm not 100% sure if we're incrementally correct with changes to the REPO symlink - it would likely depend also on details of the FUSE filesystem, eg. mtimes.

Custom FileSystem paths carved out for non-FUSE, on-demand reads

In this scenario, a new Bazel FileSystem is introduced that can resolve files on-demand (somehow) without implementing an OS-level filesystem (ala FUSE). The version being queried might be in-band in the path or out-of-band in a flag. This would primarily be useful for host environments where FUSE is not available.

For example, the same symlink pattern would be used as before, only with fake absolute paths carved out by new configuration settings:

$ ln -s /virtual/git/github.com/bazelbuild/bazel/<sha> REPO
$ grep git_virtual_repo WORKSPACE
git_virtual_repo(prefix = "/virtual/git/")  # errors if /virtual/ exists to encourage unique prefixes
$ grep REPO .bazelrc | head -n 1
build --package_path=%workspace%:%workspace%/REPO

This would require some Git logic in Bazel, but the basic Git HTTP protocol isn't so bad to be honest. I am a Mercurial user myself and would want to support Mercurial as well. Really it should be able to read files 1-1 from any HTTP object source, eg. S3-compatible storage, as long as the source snapshot is consistent. It sounds like a WORKSPACE feature, and it would be vulnerable to scope creep to capture basically all the ideas discussed that I've otherwise not mentioned. I'm not sure I totally buy into it just yet, it would be a heavy lift, but if there's a lot of hosts that can't use FUSE, maybe we should consider it.

michaeledgar avatar Nov 21 '22 19:11 michaeledgar