NNlib.jl icon indicating copy to clipboard operation
NNlib.jl copied to clipboard

register NNlibCUDA as a subpackage of NNlib

Open CarloLucibello opened this issue 4 years ago • 30 comments

CarloLucibello avatar Mar 16 '21 16:03 CarloLucibello

@JuliaRegistrator register() subdir=lib/NNlibCUDA

CarloLucibello avatar Mar 16 '21 16:03 CarloLucibello

Registration pull request created: JuliaRegistries/General/32119

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a NNlibCUDA-v0.1.0 -m "<description of version>" ca82fb23928c7ee7d08afb722718cf93be13f81c
git push origin NNlibCUDA-v0.1.0

JuliaRegistrator avatar Mar 16 '21 16:03 JuliaRegistrator

@JuliaRegistrator register() subdir=lib/NNlibCUDA

DilumAluthge avatar Mar 16 '21 17:03 DilumAluthge

Registration pull request updated: JuliaRegistries/General/32119

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a NNlibCUDA-v0.1.0 -m "<description of version>" fb5377811b3bd37d87a504d5506cc676826b9648
git push origin NNlibCUDA-v0.1.0

JuliaRegistrator avatar Mar 16 '21 17:03 JuliaRegistrator

@DilumAluthge there is also failing import NNlibCUDA with CUDA 2.6, while CUDA 3.0.0 not being released yet. I just wanted to start the registration process to make testing against CUDA#master easier, but if that's a problem we can just wait for CUDA 3.0 release

CarloLucibello avatar Mar 16 '21 17:03 CarloLucibello

Yeah no worries at all! We can merge it manually in 3 days.

DilumAluthge avatar Mar 16 '21 17:03 DilumAluthge

It might make sense to register https://github.com/FluxML/NNlibCUDA.jl instead. We wouldn't have to deal with registering it oddly and it keeps the projects separate in their own little silos. Currently NNlibCUDA doesn't get used by NNlib internally and developing it means getting unintended code alongside.

DhairyaLGandhi avatar Mar 16 '21 17:03 DhairyaLGandhi

We wouldn't have to deal with registering it oddly

don't see any oddity here, this subpackage thing is actually quite cool and already well supported by julia's toolings

it keeps the projects separate in their own little silos

That's exactly what we don't want since the two packages are tightly coupled. When we change NNlib we want to make sure everything is fine with NNlibCUDA by running its tests, and sometimes we have to make changes that involve both packages (adding some function requiring also a cuda kernel). Having NNlibCUDA as a subpackage is much safer since it allows combined CI very easily (I already put that in place), makes contributing easier, and relieves a lot of the maintenance burden.

Currently NNlibCUDA doesn't get used by NNlib internally and developing it means getting unintended code alongside.

I don't see what's the problem in also seeing the folder lib/NNlibCUDA when developing, if anyone gets confused by that he probably shouldn't be developing NNlib at all. Actually, as I said above, the possibility to make changes to both packages in a single PR is great.

CarloLucibello avatar Mar 16 '21 18:03 CarloLucibello

Not really, since most of the work in NNlib proper is not focused on CuDNN at all, in fact, the code in the two achieve very different goals. You typically either work on the kernels or the plumbing (NNlibCUDA is a bridge to CUDNN), rarely both. Most new kernels shouldn't be written twice anyway. Waiting around for GPU ci when not changing any GPU code is also a thing here, since most nnlib code isn't meant for GPUs.

if anyone gets confused by that he probably shouldn't be developing NNlib at all.

I don't think that's the message we want to send to contributors...

DhairyaLGandhi avatar Mar 16 '21 18:03 DhairyaLGandhi

I'm a little late to this, but can someone explain why NNlibCUDA exists as a subdir and a separate repo and why there wasn't a consensus made about which one it should be before now? Getting a whiff of some communications breakdown or personal enmity being involved and I don't like it.

ToucheSir avatar Mar 16 '21 18:03 ToucheSir

I needed something to get testing with CUDA#master, Julia 1.6 and share that work easily and quickly. See also https://github.com/FluxML/Flux.jl/tree/dg/cuda16 which gets flux+ cuda + 1.6 usable.

Either way is fine in the long run as long as we get our stack functional :)

DhairyaLGandhi avatar Mar 16 '21 19:03 DhairyaLGandhi

I like to be able to do ] add NNlibCUDA#branch can we do that with subdirs as well?

DhairyaLGandhi avatar Mar 16 '21 19:03 DhairyaLGandhi

@ToucheSir Dhairya and I disagree on a lot of stuff, which is of course unfortunate for two people working on the same project. There is nothing about "personal" and "enmity" involved here though.

can someone explain why NNlibCUDA exists as a subdir and a separate repo and why there wasn't a consensus made about which one it should be before now?

The first comment pointing about the possibility of using multi-package repos for this stuff was https://github.com/FluxML/NNlib.jl/issues/224#issuecomment-659186464 and then a follow-up at the end of that thread. Then there was discussion in CUDA.jl's PRs/issues and zulip, then there was #286 and here we are. So discussion was scattered but there were some entry points to it, no one at any point before now expressed negative feedback against multi-package repo, so I volunteered a few hours of my time and did this very boring work that needed to be done in the best way I could think of. I did make some decisions and I could have communicated more or better along the way, but I think the process was perfectly reasonable by community-driven project standards.

As someone doing a lot of maintenance here, I know very well that having a separate NNlibCUDA.jl would be an additional maintenance hassle (a finite and non-small fraction of which weighing on me), not a big deal but something I'd rather avoid if possible.

I see a few pros in having the multi-repo in one package approach. The only potential con I see is

I like to be able to do ] add NNlibCUDA#branch can we do that with subdirs as well?

but probably that actually works.

Finally, as Dhairyia said, we can split out a repo at any point, nothing here is irreversible, but the fact that someone already put the things in a functional state and with a nicely integrated CI (so better situation than before) is something that should be exploited.

Most new kernels shouldn't be written twice anyway. Waiting around for GPU ci when not changing any GPU code is also a thing here, since most nnlib code isn't meant for GPUs.

We have a relevant fraction of code (e.g. activation functions) that works on both gpu and cpu and should be tested on both

CarloLucibello avatar Mar 17 '21 03:03 CarloLucibello

I like to be able to do ] add NNlibCUDA#branch can we do that with subdirs as well?

Yes.

GunnarFarneback avatar Apr 01 '21 20:04 GunnarFarneback

I think setting up ci isn't something new. We also shouldn't have nnlib depending on so many heavy deps like cuda. That makes it very heavy for the rest of the ecosystem to use.

DhairyaLGandhi avatar Apr 01 '21 20:04 DhairyaLGandhi

For what it's worth, having a package in a subdir doesn't imply a dependency. Personally I wouldn't place a subdir package inside another package (rather place them as siblings in the repo) since it means that the subdir package code will be distributed with all copies of the main package, but the dependencies of the subdir package still don't affect the main package at all. Well, unless the subdir package is a dependency of the main package, but then you would get exactly the same effect in a separate repository.

GunnarFarneback avatar Apr 01 '21 20:04 GunnarFarneback

I don't think the cuda version is a dep, so it's good that those are separate

DhairyaLGandhi avatar Apr 01 '21 20:04 DhairyaLGandhi

I think setting up ci isn't something new. We also shouldn't have nnlib depending on so many heavy deps like cuda. That makes it very heavy for the rest of the ecosystem to use.

As @GunnarFarneback said, NNlib's doesn't depend on anything CUDA-related, NNlibCUDA is its own package (depending on NNlib).

Here we just have the same scaffolding of KernelAbstractions with CUDAKernels and ROCKernels https://github.com/JuliaGPU/KernelAbstractions.jl/tree/master/lib which seems nicely scalable, since we may soon have NNlibCUDA and NNlibAMDGPU.

CarloLucibello avatar Apr 02 '21 20:04 CarloLucibello

So I´d like to give this package-in-a-subdir a try, there is no problem in pulling out at any moment anyway. Maybe the people from KernelAbstractions @vchuravy @jpsamaroo can tell us whether their experience with this has been positive or not?

CarloLucibello avatar Apr 10 '21 12:04 CarloLucibello

there is no problem in pulling out at any moment anyway

As I've mentioned elsewhere, moving the location of a subdirectory package is nontrivial. You have to make sure that the new location of the package includes all of the Git tree SHAs that have already been registered in the General registry.

So I would recommend that you first make a definitive decision on which of the two options you want:

  1. Subdirectory package
  2. Package in its own repo

And then register based on that decision, with the knowledge that switching from option 1 to option 2 requires a nontrivial amount of work.

DilumAluthge avatar Apr 10 '21 12:04 DilumAluthge

@DilumAluthge there is something I don´t understand, from the answer to this question, the relocation process seemed quite easy...

CarloLucibello avatar Apr 10 '21 12:04 CarloLucibello

Idk, I've never done it myself, but here's one data point: SnoopCompileBot was originally a subdirectory package. At one point, the maintainer wanted to move it to its own repo. The maintainer found the process difficult enough that instead of moving it to its own repo, they just made a new package called CompileBot.

You can probably find that conversation by searching for SnoopCompileBot and CompileBot in PRs to the General registry.

Again, I've never done this personally, but if it was painful enough that one package author gave up and just made a new package, I find that concerning.

I would recommend at least looking over the historical discussion for SnoopCompileBot/CompileBot.

DilumAluthge avatar Apr 10 '21 13:04 DilumAluthge

ok, that seems something to consider. Let´s wait for some feedback from @vchuravy and @jpsamaroo

CarloLucibello avatar Apr 10 '21 13:04 CarloLucibello

I personally don't think we've had enough time to evaluate the merits of the subpackage approach, since KA 0.6 was only released 11 days ago. I'm personally in favor of the separate package approach, which you can always "promote" to a subpackage later without much difficulty (there are tools to combine git histories of different repos too).

jpsamaroo avatar Apr 10 '21 17:04 jpsamaroo

it seems I'm the only one hyped by this subdir approach, let's move everything to the separate repo https://github.com/FluxML/NNlibCUDA.jl then. I'll make a PR removing NNlibCUDA from here soon

CarloLucibello avatar Apr 12 '21 09:04 CarloLucibello

Oh! Unfortunately I didn't see this discussion in time. Now it's probably too late (hopefully not?), but I am very much in favor of keeping things as a single package with sub-packages. I maybe don't overlook the whole picture, but here are two reasons / examples:

  1. I just ported a C cuda kernel to julia into the NNlibCUDA sub-package. Before this was 1 fork, 1 branch, 1 PR. Now it's two each, which is annoying.
  2. This gets even worse further down the road: Consider we have NNlib, NNlibCUDA, NNlibAMD & NNlibKA. Let's suppose I have written two special GPU kernels for CUDA & AMD and now want to merge them into a single one using kernel abstractions. Maybe I even wrote a KA kernel for CPU, so I can remove the core CPU code from NNlib. So I'll have to make 4 forks, 4 PRs etc and each of the packages has its own version. I fear this will be versioning and maintenance hell. You'd have to merge everything more or less at once and tag new versions & compat entries or the packages become incompatible.

Did I get sth completely wrong?

At the danger of riding a dead horse here, but maybe @timholy can share his experiences first hand? Maybe his reasons for moving it out don't apply here.

maxfreu avatar Apr 21 '21 09:04 maxfreu

As the maintainer of this package and others, I understand these concerns, deal with the consequences and think this was the right move. You see:

  1. setting up the development environment is nontrivial esp over subpackages with different or conflicting needs (thus discouraging contributions) (see https://github.com/FluxML/NNlib.jl/pull/296)
  2. The disruptions to NNlib itself over something trivial (https://github.com/JuliaRegistries/General/pull/32119#issuecomment-816976264)
  3. That recent NNlib releases contained GPU dependencies (thus saying the setup is prone to errors)(https://github.com/FluxML/NNlib.jl/commit/5c2690bf9fbd67b34d86b9f7cb02f2ebd91b8e5b#commitcomment-49262320) its fair to say that this kind of change requires some very strong gains of whether it is worth sacrificing the stability of over 250 packages in the ecosystem.

Remeber too that a design that expects an NNlibX necessitates some amount of code/ logic repetition. It's also fair to say that NNlib devs shouldn't be expected to maintain every single spin-off, thus also making the monorepo difficult to scale and QA.

The GPU backend really does achieve different end goals so it makes sense to be separate. In fact, when the time comes, it would be better to remove the CUDA moniker and have it depend on GPUCompiler only, thus obviating the need for an AMD and KA flavour.

The versioning and following proper semver can be handled like we do with so many dependencies in the ecosystem. Yes it would mean working over multiple repos, but the versioning standard isn't any different for sub packages, so it's the same regardless of where the code lives.

Recently, a faulty NNlib was also the reason behind the breakage of the tutorials that involve plotting, which also includes all the SciML ones, so I would want to ensure some amount of stability, which does require us to separate the concerns of different sections of code.

DhairyaLGandhi avatar Apr 21 '21 09:04 DhairyaLGandhi

Ok, thank you for the write-up. I'm still not convinced, but let's see where it takes us :)

maxfreu avatar Apr 21 '21 10:04 maxfreu

reopening this after a few months of working with NNlibCUDA as a separate package. In the work for gather/scatter we add to ping-pong between the two libraries multiple times, it has been very annoying, so I encourage people to reconsider the option of having NNlibCUDA as a subpackage. Also see #332.

Moreover, according to https://github.com/FluxML/Flux.jl/pull/1566/files we are going to have FluxCUDA, FluxAMDGPU etc.. as subpackages of Flux, so NNlib should reflect that.

CarloLucibello avatar Jul 11 '21 10:07 CarloLucibello

No decision has been made on that front. The issue seems to be that there isn't so much a need for the cuda package at all but that the code for gather should have been vetted to work with the Flux counterpart before merging.

DhairyaLGandhi avatar Jul 11 '21 11:07 DhairyaLGandhi