iree First step of adding nccl headers to the build.

I ended up reworking the CUDA toolkit download because we really need to full toolkit now (or at least more headers from it), and the support that was implemented had some issues.
I made it conditional to use an external toolkit if CUDAToolkit_ROOT is specified and download/use an internal toolkit otherwise. This is a change of behavior but probably reasonable: I've never had much success with the pure auto-detection without setting a CUDAToolkit_ROOT to a pre-installed version.
Just uses the nccl make machinery to auto-gen the header files and make them accessible. This may be overkill: it is just doing token replacement of some version bits. Open to feedback here.

I suspect that before landing this, I will need to make a corresponding change on the CI to include more of the toolkit.

Jun 18 '22 02:06 stellaraccident

Thanks for helping iterate on this - great cleanup!

I think we may have some larger implications coming along with nccl, though, that we'll have to work through. May be worth a gvc with you @mattwalsh and I to brainstorm.

One is that NCCL doesn't work on Windows, meaning that if we start to unconditionally use it we lose Windows support. I'm not a fan of that, obviously :P I think we may want a dedicated cmake configuration option for enabling NCCL (IREE_HAL_DRIVER_CUDA_NCCL or something?), and when not available we'd have fallbacks either at runtime or emitted in the compiler just as we need to do with other backends; so all-reduce would also emit a reduce and if NCCL wasn't compiled in or enabled at runtime it'd use that. There are people who have had it working under WSL (and others who haven't) and even some ports to Windows but having a clean way to disable it seems universally good regardless (and easier).

The other big one is that NCCL only supports CUDA static linkage (or dynamic but not via dlopen by way of CUDARTLIB https://github.com/NVIDIA/nccl/pull/220/files). That puts us in a weird situation as we use the driver API dynamically while NCCL uses the runtime API statically. Worse, it means we are violating ODR with any binary that embeds us and also uses NCCL. The implication there is that we need to be set up for users to specify the NCCL library instead of us using our own unless as a fallback for our own binaries - if we've got a submodule pinned to some version it's unlikely to match what the user has their version/system pinned to.

The static linkage and runtime API usage is rather unfortunate. If we're going to use NCCL it may radically change the way we have to integrate CUDA, and that's a little scary :(

Jun 21 '22 14:06 benvanik

Yeah, I was figuring it needed to be optional at some point which is why I have the c guard -- it is just unconditionally true now and could use a cmake flag instead if needed but I don't think we're going to need it. See below.

We can chat over VC, but I think I'm seeing something other than you about dynamic linking -- it looks ok to me. We're really only using the build time nccl hookup to reference the headers for enums/types/etc. I just hadn't typed in the actual trampoline.

Since it is just a header dep, I think I can just ifdef it out on windows -- no build goo needed. May not even need to do that since the headers should be fine on windows, but obviously it won't be possible to find at runtime.

I was thinking that we could have a hard switch when creating the driver: if you create with a "usenccl" or "ncclcommunicator", then we do all of the extra steps to initialize it, failing if not available. Then, yeah, generated code could switch in that somehow.

Jun 21 '22 15:06 stellaraccident

Ah, so you're thinking of building your own nccl shared library with exported symbols we can dlopen? I think that's what we talked about, but pre-coffee this morning looking through the nccl code didn't connect the dots as they don't manage their symbol visibility (we'd probably need a linker script or something).

It doesn't solve the unfortunate situation where then nccl will be mutating thread local state and loading the full cuda runtime, but if it's optional unless we dlopen it that's fine. We have #9301 we need to do regardless.

Jun 21 '22 15:06 benvanik

No, I was just going to dlopen the system one at runtime. I haven't had my coffee yet either so may be missing something, but this seems all set up correctly to do if I install the package or (as a user) build from source. Nccl dynamically links to the cudart (it can also statically link, but down that path lies madness for us).

Jax currently vendors a copy of nccl locally but I think they would like to change to what I'm proposing.

Jun 21 '22 16:06 stellaraccident

Ahhh I think maybe the disconnect is that us submoduling nccl makes me think we're pinned to that nccl version we're building ourselves because that's what you'd generally use a submodule for. If we just need the header and it's version independent from the actual library we'll be using I think we want to avoid the submodule and use a FetchContent or something instead. That is to say, we shouldn't have a version-specified dependency unless we are actually using that version (which I think we agree we don't want to).

Jun 21 '22 16:06 benvanik

Sorry - I could have mentioned that. I was thinking the better was may be too just let external_project pull it in. The pinning is more so that builds don't mysteriously start failing, etc.

I was also thinking of rolling a new iree-cuda-extras python package with all of these bits built to our specifications, and then making the python apis set some flags to use those binaries somehow... But that would be more for user supportability in the field (ie. Do this one simple thing and it will work) without polluting the core build: we would always have the core runtime be able to use the system installed nccl (of stars align).

Jun 21 '22 16:06 stellaraccident

We only need the header for a handful of tiny enums - we should just replicate those and then we don't even need the header file. We can't use the function declarations in the header as we'll be dlsyming them so doing all this just to get 4 enums seems like overkill. This is what pytorch does: https://sourcegraph.com/github.com/pytorch/pytorch/-/blob/torch/csrc/cuda/nccl.h?L43

Here's what we need: https://github.com/NVIDIA/nccl/blob/7aa1c46fd551d5514474cfc55848219dc67ca683/src/nccl.h.in#L35-L41 https://github.com/NVIDIA/nccl/blob/7aa1c46fd551d5514474cfc55848219dc67ca683/src/nccl.h.in#L105-L150

We'll be wrapping the results up in iree_status_t anyway and mapping from HAL buffer types to the nccl types, so they're going to be sequestered in a switch statement in some helper file anyway and we may as well just redefine what we need.

Alternatively given the size of the header (~350lines) and license (BSD) I think it could just get checked in under third_party/ to avoid cmake goo, but I think I'm happier to rewrite those enums if it saves us the submodule or extra external header and removes the confusion :)

Jun 21 '22 16:06 benvanik

I was debating that point myself: convinced. I'll keep the cuda configuration goo on this patch as it is an improvement over what we had before, which I couldn't get to actually work right.

Jun 21 '22 16:06 stellaraccident

Alternatively given the size of the header (~350lines) and license (BSD) I think it could just get checked in under third_party/ to avoid cmake goo, but I think I'm happier to rewrite those enums if it saves us the submodule or extra external header and removes the confusion :)

I'm confused about this bit. Wouldn't we still want to check this in under third_party/? Or do you mean that we just have a file containing a couple enums rather than copying the whole thing? Given that it's <400loc, that doesn't really seem worth it, but :shrug:

Jun 21 '22 17:06 GMNGeoffrey

We would just add the handful of enum/values inline with the function declarations that use them. Given the PyTorch precedent, this is pretty straight-forward and the simplicity of this library argues for just doing that directly vs the more exaggerated route (i.e. as is used for cuda.h).

Jun 21 '22 17:06 stellaraccident

Yeah agreed, what we're talking about here is taken to its extreme effectively something like:

iree_status_t iree_hal_nccl_result_to_status(int result) {
  switch (result) {
    case /*ncclSuccess=*/0: return iree_ok_status();
    case /*ncclUnhandledCudaError=*/1: return iree_make_status(IREE_STATUS_FOO);
    ...
  }
}

(though we still probably want enums for readability - but the return codes are part of their API and we don't have to copy their code verbatim to have enums with the same values in our client code)

Jun 21 '22 17:06 benvanik

Hey @stellaraccident, the final piece of https://github.com/google/iree/issues/8971 is to enable CUDA by default.

Given this change, iis still relevant since this is changing the behavior of CUdA dependencies?

Jun 21 '22 19:06 KoolJBlack

Hey @stellaraccident, the final piece of #8971 is to enable CUDA by default.

Given this change, iis still relevant since this is changing the behavior of CUdA dependencies?

I'd like to break just the CMake changes out of this PR and land them before enablind CUDA by default. I found and fixed a few issues that I wouldn't want to inflict on others without this patch (i.e. it was re-downloading on every cmake invocation, and there was a lot of instability I witnessed in the detection).

Jun 24 '22 01:06 stellaraccident