scons [WIP] Remote caching support

This change introduces remote caching support to SCons. Currently SCons has support for caching built objects in directories, but remote caching adds support for sharing built objects across multiple machines. For example at VMware, we have our official builds push to cache and all developers will fetch from cache.

The end result of this is that developers are able to only build what they have changed locally. All other cacheable build actions will be fetched from the remote cache server, dramatically speeding up builds.

One piece that I introduced as part of this change is a new scheduler as a new class ParallelV2 that is used if remote caching is enabled or if a developer passes --use-scheduler-v2. Whereas the existing parallel scheduler waits on jobs if the job queue is full, this new scheduler will attempt to continue scanning for new jobs whenever possible. In addition, it supports remote caching by draining the "pending remote cache fetches" queue just like it drains the "pending jobs" queue. I believe that this scheduler is an alternative to https://github.com/SCons/scons/pull/3386 that should have the same effect in keeping jobs as high as possible. I introduced it as a separate class for now, but it could replace the Parallel class if we wanted.

The new RemoteCache class owns the job of fetch from and pushing to a Bazel remote cache server or any other similar server that supports /ac/ and /cas/ GET and PUT requests using SHA-256 file names. See https://github.com/buchgr/bazel-remote for more details on the server. This class uses urllib3 for network requests (see below acknowledgement that this violates SCons guidelines). I chose it because it has good support for concurrency across threads using its ConnectionPool class. If remote caching is requested and urllib3 is not present, it raises an exception. If remote caching is not requested, urllib3 is not required.

The change to the scheduler is needed because cache fetches necessarily must be asynchronous due to network latency. If a request can take 200ms, we can't hang the main thread while waiting on it. And we also can't do the requests inside of job threads (like CacheDir does) because we want to have more active network requests than job threads given that network requests are bound by network I/O, not disk I/O or CPU.

Given the aforementioned change to the scheduler, I don't see a way to make this into a Tool, which was a possibility mentioned on Discord a while back. That is, I wasn't able to find a good way to make this self-contained enough to be a Tool.

As part of implementing this functionality, the following new parameters are introduced:

--remote-cache-fetch-enabled: Enables fetch of build output from the server
--remote-cache-push-enabled: Enables push of build output to the server
--remote-cache-url: Required if fetch or push is enabled
--remote-cache-connections: Connection count (defaults to 100)

I understand that this patch violates the SCons guidelines in two main ways:

The new functionality requires urllib3 when the guidelines say that we must not require any packages other than pywin32. I don't believe that I can redo this change without urllib3, so the [WiP] tag is about vetting this approach with the team to see whether it is a fatal flaw with this approach.
There is no documentation for the new functionality. If the approach is accepted, I would resolve this as part of removing the [WiP] tag.

Contributor Checklist:

[ ] I have created a new test or updated the unit tests to cover the new/changed functionality.
[ ] I have updated CHANGES.txt (and read the README.rst)
[ ] I have updated the appropriate documentation

Jul 13 '21 12:07 grossag

Any traction on this ?

Jun 11 '23 18:06 luzpaz

Any traction on this ?

Not yet. It's on the list, just requires a big chunk of time to review and likely modify before being able to pull into the repo.

Jun 11 '23 19:06 bdbaddog

@bdbaddog I am going to hold off on doing any future updates or merges until you think there is a shot of finalizing a design and landing this. Let me know if you want me to start doing any of that, but I am fine holding off or even closing this if the team doesn't want it.

I think the biggest reason to not include remote caching in SCons is that SCons has no dependency enforcement functionality. We implemented dependency enforcement internally at VMware but it was done in such a VMware-specific way that I don't have any way to upstream it without significant support from the SCons team. I am open to finding a way to upstream it by providing the code we have written in VMware-specific layers, but it would take a decent amount of effort from VMware+SCons maintainers to adapt and integrate. That said, the benefit of this logic is that you can run SCons in a mode where it actually fails compilation if (1) a build action reads from a file that is not in its dependencies or side effects or (2) a build action writes to a file that is not in its targets or side effects. Right now SCons can easily be wrong about its dependencies.

Remote caching without dependency enforcement is risky in that if SCons is missing a dependency (e.g. the scanner didn't find a .h file), if that dependency is changed we will continue to get remote cache hits when they should be cache misses. So in the example I just mentioned, if you change that .h file that SCons doesn't know about, you'll still get cache hits when you shouldn't. This could cause a lot of confusion because the SCons build output is different than you expect.

Notably, this is also the case with the existing SCons "cache dir" functionality. But that is why we never used cache dir functionality internally.

Jun 12 '23 10:06 grossag

@bdbaddog I am going to hold off on doing any future updates or merges until you think there is a shot of finalizing a design and landing this. Let me know if you want me to start doing any of that, but I am fine holding off or even closing this if the team doesn't want it.

I think the biggest reason to not include remote caching in SCons is that SCons has no dependency enforcement functionality. We implemented dependency enforcement internally at VMware but it was done in such a VMware-specific way that I don't have any way to upstream it without significant support from the SCons team. I am open to finding a way to upstream it by providing the code we have written in VMware-specific layers, but it would take a decent amount of effort from VMware+SCons maintainers to adapt and integrate. That said, the benefit of this logic is that you can run SCons in a mode where it actually fails compilation if (1) a build action reads from a file that is not in its dependencies or side effects or (2) a build action writes to a file that is not in its targets or side effects. Right now SCons can easily be wrong about its dependencies.

Remote caching without dependency enforcement is risky in that if SCons is missing a dependency (e.g. the scanner didn't find a .h file), if that dependency is changed we will continue to get remote cache hits when they should be cache misses. So in the example I just mentioned, if you change that .h file that SCons doesn't know about, you'll still get cache hits when you shouldn't. This could cause a lot of confusion because the SCons build output is different than you expect.

Notably, this is also the case with the existing SCons "cache dir" functionality. But that is why we never used cache dir functionality internally.

Dependency enforcement hasn't been an issue for most users, or at least that anyone has let the project know. In general if the scanner's missing things, it should get fixed.

That said, for some environments, it's worth the cost to do so. (VMWare being one of those environments I'm guessing.. ;)

Does your dependency enforcement implementation work on windows, linux & mac? or only some of those? Or require admin type access on some of those platforms?

Probably worth moving any discussion of dependency enforcement to a discussion or user's mailing list as IMHO it's not really a required part of this remote caching functionality.

Ideally you'd push whatever updates you have to this functionality to this PR. Spending time reviewing outdated code would be a waste.

As an aside, it's likely the new parallel scheduler will become the default in the near future.

Realistically reviewing this code will take a solid day or two, ideally with access to you @grossag to answer questions. Any chance we can get you to join our Discord server for such consultation?

Jun 12 '23 16:06 bdbaddog

Sorry for the delay in responding. I've been on the Discord for a while, although I don't pay real close attention. I'm on Discord as @ adamgross .

For the dependency enforcement, it depends on the platform:

Windows uses tracker.exe from Visual Studio to wrap command outputs.
Linux uses a kernel driver which will hopefully be open sourced at some point.
Mac uses a wrapper script which invokes dtrace; the wrapper script needs a sudoers entry to work properly.

Dependency enforcement would be really hard to upstream, so I am fine not focusing on that. But you said "In general if the scanner's missing things, it should get fixed." and the issue is that it's nearly impossible to figure out if the scanner is really missing things without any guarantees from SCons.

I am open to updating the pull request if you think this is mostly unblocked and worth proceeding on. But I vaguely recall from our last conversation that you wanted to wait until some of the Task logic was refactored until this could be considered. I am also unsure on whether it's a dealbreaker that this uses urllib3. I just don't want to burn a day updating this PR if it's not worth proceeding on. Let me know how you'd like me to proceed and I'll do my best to accommodate, thanks.

Jul 12 '23 13:07 grossag

What do we need to be able to test against remote caching server? How do we set one up,etc?

Aug 03 '23 18:08 bdbaddog

What do we need to be able to test against remote caching server?

How do we set one up,etc?

Sorry I never responded to your question @bdbaddog . This PR is designed to use https://github.com/buchgr/bazel-remote with no additional modifications to that project. That repo's README has a "Docker" section which describes how to install and run it. It's effectively content-addressable storage where the ac/ folder stores the action metadata and cas/ folder stores the binary. ac/ is addressed by a unique action signature and cas/ is addressed by SHA-256 hash of the binary.

Mar 10 '24 13:03 grossag

What do we need to be able to test against remote caching server? How do we set one up,etc?

Sorry I never responded to your question @bdbaddog . This PR is designed to use https://github.com/buchgr/bazel-remote with no additional modifications to that project. That repo's README has a "Docker" section which describes how to install and run it. It's effectively content-addressable storage where the ac/ folder stores the action metadata and cas/ folder stores the binary. ac/ is addressed by a unique action signature and cas/ is addressed by SHA-256 hash of the binary.

Thanks! Any chance you'd be willing to resolve the conflicts in this PR? I'd also like to see if there's a way to use bazel's remote execution protocol with SCons as well. Seems like leveraging all the work other projects do for SCons would make sense if it's workable.

Mar 10 '24 19:03 bdbaddog

@bdbaddog : I was able to resolve all other conflicts other than the big one: integrating remote cache functionality into the NewParallel class. It makes sense for me to drop the new scheduler I implemented and fold functionality into NewParallel, but the leader/follower pattern is throwing me off.

When remote caching is enabled, any time a task is pulled off of the todo list (code line task = self.taskmaster.next_task() of SCons/Taskmaster/Job.py), we first need to dispatch it to the RemoteCache for evaluation as to whether it can be fulfilled with a cache hit. This is different than the existing CacheDir class, which can be a synchronous disk call.

In my PR this was implemented using local variable fetch_response_queue. Tasks are sent to RemoteCache class and then when they are confirmed as cache hits or misses, they end up as results in fetch_response_queue. Then the rest of the implementation was related to managing two queues: fetch_response_queue (RemoteCache responses) and self.tp.resultsQueue (executions of tasks that were cache misses).

Can you let me know if you have any thoughts on how best to implement asynchronous RemoteCache requests into NewParallel scheduler?

Mar 14 '24 20:03 grossag

@acmorrow - perhaps you can assist @grossag with NewParallel?

Mar 14 '24 22:03 bdbaddog

@bdbaddog and @grossag - Yes, definitely, but it'll be next week.

Mar 15 '24 14:03 acmorrow

@bdbaddog and @grossag - Yes, definitely, but it'll be next week.

@acmorrow Can you provide some guidance here as to how to resolve the conflicts in the scheduler as described in https://github.com/SCons/scons/pull/3971#issuecomment-1998353396 ?

Jul 16 '24 11:07 grossag

So one design question is whether to favor command-line arguments, or internal project configuration. It feels to me like if you're going to make use of a supplemental remote cache, you generally want to do so (almost) all the time, and typing long options repeatedly is cumbersome. Yes, you can hide that in something that launches SCons for you, but still... at the least the options should be "settable" so you can SetOption them.

Here's how ccache describes their setup (and yes, they do have command-line options)

https://ccache.dev/manual/latest.html#_storage_interaction

Oct 01 '24 19:10 mwichmann

scons scons copied to clipboard

[WIP] Remote caching support

Contributor Checklist:

scons
scons copied to clipboard