rules_cuda icon indicating copy to clipboard operation
rules_cuda copied to clipboard

Initial remote hermetic cuda toolchain

Open jsharpe opened this issue 2 years ago • 7 comments
trafficstars

Depends on #66

The BUILD files generated for each of the downloaded repos are a bit hacky at the moment and BUILD.remote_cuda is probably not fully updated (I only updated the bits I needed).

CUDA 12 requires bringing in libcu++ as an external dep otherwise nv/target is missing.

I've not checked this for reproducibility but ancedotedly in the debugger I've seen source paths that are RBE worker dependent so I suspect there are some reproducibility issues with the current setup.

The other thing likely missing is runfiles for runtime dependencies from the remote toolchain.

jsharpe avatar Mar 20 '23 23:03 jsharpe

What's the status here? This would be a very welcome feature for more than one project I'm contributing to! So far the solution is a custom archive with the CUDA SDK and some local patch to rules_cuda that makes it download that first and then set it up as if it was local. Having this work out of the box with rules_cuda and also only download what is actually used would be much nicer, of course!

ahans avatar Jun 28 '24 11:06 ahans

What's the status here? This would be a very welcome feature for more than one project I'm contributing to! So far the solution is a custom archive with the CUDA SDK and some local patch to rules_cuda that makes it download that first and then set it up as if it was local. Having this work out of the box with rules_cuda and also only download what is actually used would be much nicer, of course!

The code in this PR works (although it has bit rotten a bit - the branch I'm actually using is remote_toolchain in my fork of the repo) but it breaks the support for the local setup use case. I don't really have the time at the moment to make both work in a single repo so some help on getting this working woudl be appreciated; there's likely some bits that can be broken out into separate PRs and landed independently to get us there in smaller steps as this is a rather large change otherwise.

jsharpe avatar Jun 28 '24 12:06 jsharpe

I think this should be split into mulitple step.

  1. support composing multiple components (say, local_cccl, local_cublas, local_thrust, local_cub) into a unified local_cuda
    • this might allow reusing pip and conda installs
  2. support instantiating thoes local_* from local tar balls
  3. support downloading tar balls
  4. support parsing json at https://developer.download.nvidia.cn/compute/cuda/redist/
    • this effectively supports https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#tarball-and-zip-archive-deliverables

cloudhan avatar Jun 28 '24 12:06 cloudhan

Ah yes I remember now - https://github.com/NVIDIA/cccl/issues/622 was the issue I raised so that I could effectively get CCCL into a bzlmod support repository as a fully hermetic toolchain will require downloading these separately. IMO the cccl / thrust / cub targets shouldn't be inside local_cuda; they're just libraries, the fact that they can be provided by a local_cuda repo is incidental. Note that thrust in particular can be used independently of a CUDA install - it works just as well in an OpenMP context on a host..

jsharpe avatar Jun 28 '24 12:06 jsharpe

local_cuda is a name inherited from tf_runtime impl, this should have been called local_cuda_toolkit, so every components stated in the doc will have a position (maybe overrideable).

The last step might not be as trivial as it seems to be. For example, in CI, you might want to override all those links with your server. If we fetch the json instead of checkin directly, we don't need the URL to be stable. And if we let the user provide the json url, we don't even require the URL itself to be stable. (we need the json schema to be stable tho...

cloudhan avatar Jun 28 '24 12:06 cloudhan

Unfortunately the json schema hasn't proven to be stable - its changed in the 12 series of releases.. - its only the addition of some extra keys, but it was enough to break the logic I had in here.

jsharpe avatar Jun 28 '24 13:06 jsharpe

Thanks for the update, @jsharpe and @cloudhan, much appreciated! I will look at the mentioned branch in @jsharpe's fork. I don't care about supporting anything locally installed too much myself, but since that has been the only option for rules_cuda, I understand that it shouldn't be taken away. I will see if I can help with anything, but no promises.

ahans avatar Jun 28 '24 14:06 ahans