build icon indicating copy to clipboard operation
build copied to clipboard

Provide Bazel cache for TensorFlow builds

Open angerson opened this issue 5 years ago • 21 comments

Providing a TensorFlow build cache could be very helpful to external developers, and lower the barrier to entry of contributing to TF.

Some ideas for this we've discussed before are:

  • Offer Bazel RBE resources on behalf of SIG Build. This service is in alpha on GCP.
  • Provide a read-only build cache in a GCP bucket.
  • Provide devel_cache Docker images containing a build cache (these could be very large)
  • Provide code-and-cache volumes for the docker devel images.

See also:

  • https://github.com/tensorflow/tensorflow/issues/39560
  • https://github.com/tensorflow/tensorflow/issues/4116
  • https://github.com/tensorflow/addons/issues/1414

angerson avatar May 14 '20 23:05 angerson

I'm looking into the feasibility of providing GCP resources (likely a long-term discussion) and devel_cache images as an evaluation (short-term, but no ETA).

angerson avatar May 14 '20 23:05 angerson

I want to just add a reference that we could need to solve this also for let the user to adopt the new Github Codespace/Vscode Remote (https://github.com/tensorflow/addons/pull/1309) or for Gitpod (https://github.com/tensorflow/tensorflow/pull/38755).

bhack avatar May 15 '20 11:05 bhack

It would be also nice as many SIGs builds using github Actions CI infra, specially the ones with c++/cuda custom ops, if we could find a way to recycle the bazel cache to speed-up CI builds. We have tried to use the bazel cache in Action cache for the CI (https://github.com/tensorflow/addons/issues/1655) but it is not working. If you see in this ticket we have external request on Github Action repo.

bhack avatar May 15 '20 13:05 bhack

It would be also nice as many SIGs builds using github Actions CI infra

This would be excellent! For reference, some time ago there were some discussions about improving bazel cache support in GitHub actions at https://github.com/actions/cache/issues/109

lgeiger avatar May 15 '20 13:05 lgeiger

@lgeiger Our ticket was https://github.com/actions/cache/issues/260. I don't know if they could be fused or not.

bhack avatar May 15 '20 13:05 bhack

This will be orthogonal to the approved TF modularizzation RFC

bhack avatar May 15 '20 16:05 bhack

We have started to explore internally to see if we can share our RBE cache. We will also look into if we can share a GCS cache.

gunan avatar May 29 '20 17:05 gunan

@gunan Thanks I've intercepted this candidate dup https://github.com/tensorflow/tensorflow/issues/34719. Probably you can find some other ones on the TF repo.

bhack avatar May 29 '20 17:05 bhack

Yes, this has been a long running problem for TF. And as TF gets bigger it will only get worse.

gunan avatar May 29 '20 18:05 gunan

If this is going to take too much time can we find an intermediate goal like having support for python only PR? I think that it could be easier as an intermediate step. What do you think?

bhack avatar Jun 15 '20 21:06 bhack

See what kind of bad hack I need to suggest https://github.com/tensorflow/tensorflow/pull/41701#discussion_r460587524

bhack avatar Jul 26 '20 23:07 bhack

/cc @perfinion for @gunan post in https://groups.google.com/a/tensorflow.org/forum/m/#!topic/developers/1OJLv2ew7pA

bhack avatar Aug 02 '20 20:08 bhack

I've tested your initial cache inside official TF Docker devel image but it has not the cross tools (d7/d8) like RBE and custom-ops Dockerfiles/images.

We have a threads in SIG-build Gitter channel

bhack avatar Aug 29 '20 16:08 bhack

This would be great. It is very frustrating that I have to spin up docker images and compile C++ code overnight just to test a single line of code change to a Python function. The barrier to entry to contributing is extremely high. What I often end up doing is copying test_xyz.py as test.py, editing the tensorflow install in my virtual env and running test.py then crossing my fingers that CI passes.

adriangb avatar Nov 23 '20 05:11 adriangb

Also when we are mounting the bazel cache inside the official Tensorflow Docker devel container we need to improve the stale cache handling. Too often I see Deleting stale sandbox base is it related to https://github.com/bazelbuild/bazel/issues/8525? Seems that one was closed in Bazel 3.4.0.

bhack avatar Nov 24 '20 12:11 bhack

In the meantime can we reply to https://groups.google.com/a/tensorflow.org/g/developers/c/1OJLv2ew7pA?

Is there a quick solution to iterate and modify the source code and run an example in the source dir without building and installing the wheel?

bhack avatar Jan 20 '21 13:01 bhack

It seems that now we have a read only cache for TF IO but still not for Tensorflow contributors:

https://github.com/tensorflow/io/pull/1294

bhack avatar Apr 01 '21 16:04 bhack

@bhack Given this situation, what is the best way to build TensorFlow while making small changes to the codebase? Can you please outline the procedure? TIA

AdityaKane2001 avatar Apr 19 '21 17:04 AdityaKane2001

With @angerson and @perfinion we are prototyping with https://github.com/tensorflow/tensorflow/pull/48421 (and https://github.com/tensorflow/build/pull/24) to continuously execute and monitor the external developer contribution experience/overhead (compile, lint and test).

/cc @theadactyl @nikitamaia

bhack avatar Apr 19 '21 17:04 bhack

I think we could close this and monitor the build reproducibility and cache efficiency in https://github.com/tensorflow/build/pull/48

bhack avatar Nov 30 '21 19:11 bhack

We have now a PR at https://github.com/tensorflow/tensorflow/pull/57630 if you want to support/review/imporve this baseline.

bhack avatar Sep 06 '22 22:09 bhack