NormalizingFlows.jl Initial work on CUDA-compat

It seems overloading an external package in an extension doesn't work (which is probably for the better), so atm the CUDA tests are failing.

But if we move the overloads into the main package, they run. So probably should do that from now on.

Aug 10 '23 15:08 torfjelde

I think now the CUDAext works properly, current tests about cuda all passes. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using NormalizingFlows

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

x = rand(rng, q0_g) # good

However, there is still issue to fix---sample multiple samples at once, and sample from Bijectors.TransformedDistribuition . Minimal examples are as follows:

sample multiple samples in one batch

xs = rand(rng, q0_g, 10) # ambiguous

error message:

ERROR: MethodError: rand(::CUDA.RNG, ::MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate, Continuous}, n::Int64)
    @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:23
  rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate}, n::Int64)
    @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:21
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous}, n::Int64)
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:16

Possible fix, define
  rand(::CUDA.RNG, ::Sampleable{Multivariate, Continuous}, ::Int64)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:42

sample from Bijectors.TransformedDistribution:

y = rand(rng, flow_g) # ambiguous

err meesage:

ERROR: MethodError: rand(::CUDA.RNG, ::MultivariateTransformed{MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ComposedFunction{PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, td::MultivariateTransformed)
    @ Bijectors ~/.julia/packages/Bijectors/cvMxj/src/transformed_distribution.jl:160
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous})
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:7

Possible fix, define
  rand(::CUDA.RNG, ::MultivariateTransformed)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:40

This is partially because we are overloading methods and types that do not own by this pkg. Any thoughts about how to address this @torfjelde @sunxd3?

Aug 15 '23 23:08 zuhengxu

I don't have a immediate solution other than the suggested fixes. It is indeed a bit annoying, maybe we don't dispatch on rng?

Aug 16 '23 17:08 sunxd3

It is indeed a bit annoying, maybe we don't dispatch on rng?

Yeah, I agree. For temporary solution, I'm thinking adding an additional argument for Distribution.rand, something like device to indicate on cpu or on gpu. But for long term fix, Im now leaning towards your previous attempts. Although this will require resolving some compatibility issue with Bijectors.

Aug 16 '23 21:08 zuhengxu

Honestly, IMO, the best solution right now is just to add our own rand for now to avoid ambiguity errors.

If we want to properly support all of this, we'll have to go down the path of specializing the methods further, i.e. not do a Union as we've done now, which will take time and effort.

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

How does that sound?

Aug 17 '23 12:08 torfjelde

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

Yeah, after thinking about it, I agree that this is probably the best way to go at this point. Working on it now!

Aug 20 '23 00:08 zuhengxu

I have adapted the NF.rand_device() approach. I think now we have a work around. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using Flux
import NormalizingFlows as NF

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts = reduce(∘, [f32(Bijectors.PlanarLayer(2)) for _ in 1:2])
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

@torfjelde @sunxd3 Let me know if this attempt looks good to you. If so, I'll update the docs.

Aug 22 '23 05:08 zuhengxu

@torfjelde @zuhengxu, can you try to finish this PR if nothing major stands?

Mar 05 '25 21:03 yebai

@yebai @torfjelde @sunxd3 I’m wondering if it might be better to start by integrating Lux.jl and Reactant.jl (#40). This would require some modifications to the interface and how we interact with Bijectors.jl. Specifically, should we keep everything wrapped within the Bijectors.jl interface as we currently do, or should we consider relaxing the existing NF API? (If we can achieve the former without too much effort, that would be ideal!)

My main argument for full integration with Lux.jl and Reactant.jl is that it would make switching between different devices much more seamless.

What are your thoughts on this?

Mar 06 '25 01:03 zuhengxu

I am for start working on integration with Lux (and Reactant because Lux seems to work best with it). I would need to do some experiments before I can give some more informed opinions. But for this PR, ~it think it'll take too much effort to push it to anywhere complete~ (edit: this was referring to GPU, but after closer look, I think the scope of this PR is good, so I can help tidy up).

Mar 07 '25 08:03 sunxd3

@sunxd3, can you extract the rand_device part for a new PR? One needs compatibility with CUDA.jl since it is the most mature GPU library in Julia

Integration with other libraries is a good idea, but it should be done after fixing the compatibility of CUDA.jl.

Mar 07 '25 10:03 yebai

Some benchmark: https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc

Mar 14 '25 10:03 sunxd3

I'm very sorry - I don't think I have the knowledge to review this. Happy to try to look at specific questions if you have any though.

Mar 14 '25 14:03 penelopeysm

I'm very sorry - I don't think I have the knowledge to review this.

No worries, @penelopeysm. It might take you a while to become familiar with TuringLang libraries. In the meantime, please feel free to comment from the RSE perspective!

Mar 14 '25 14:03 yebai

Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?

Mar 15 '25 18:03 Red-Portal

Thank you for all the comments and feedbacks! Here are some of my thoughts.

Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?

I don't think it works; or I could be not doing it the right way. The following code doesn't work:

using Random, Distributions, LinearAlgebra
using CUDA

rng_g = CUDA.default_rng()
q0_g = MvNormal(CuArray(zeros(T, 2)), CuArray(ones(T, 2)))

rand(rng_g, q0, 10)

This will return a cpu array

julia> rand(rng_g, q0, 10)
2×10 Matrix{Float32}:
 -0.322776   0.129745  1.08983   1.06144   -1.09644  -0.206741  0.273979   -0.518334  -1.35439    -1.79912
  0.259409  -0.583322  0.601871  0.571779   1.41544  -1.19601   0.0845693   0.766666  -0.0962037   0.0355074

that's why we need to have this _device_specific_rand function that can dispatch on the rngs.

Since we're only using Gaussians, randn should be more useful to be precise.

This part I agree. @sunxd3 maybe we can leverage the existing gpu compatibility of randn, e.g.,

julia> rng_g = CUDA.default_rng()
CUDA.RNG(0xd97bf5d1, 0x0000004a)

julia> T = Float64
Float64

julia> randn(rng_g, T, 2, 10)
2×10 CuArray{Float64, 2, CUDA.DeviceMemory}:
 -1.26503  -0.152555   0.921247  -1.05752  0.581864  -2.30691    0.26545   -1.1928     0.061966  -1.42031
 -1.81811  -0.393087  -1.85974   -1.39682  0.349741  -0.624886  -0.191652  -0.347422  -1.4505     1.74004

Mar 17 '25 01:03 zuhengxu

Yeah avoiding Distributions.jl like the plague is necessary.

Mar 17 '25 01:03 Red-Portal

yep, randn is used at https://github.com/TuringLang/NormalizingFlows.jl/blob/3f07fe54784e795cc3669a78363fc46c4a0cdffb/ext/NormalizingFlowsCUDAExt.jl#L43-L48

The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.

Mar 17 '25 09:03 sunxd3

yep, randn is used at

https://github.com/TuringLang/NormalizingFlows.jl/blob/3f07fe54784e795cc3669a78363fc46c4a0cdffb/ext/NormalizingFlowsCUDAExt.jl#L43-L48

The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.

Yeah, overall the code under the hood looks great to me (thanks again @sunxd3!). My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing. I think wrap this into a more intuitive name and dispatch on the rng type would be more inituitive.

Aside from this, since the sampling function and logpdf function works with cuda, let's add some test with the whole pipline to see if the planar flow training and evaluation on GPU all work properly. For example, run this planar flow test with GPU.

Mar 17 '25 18:03 zuhengxu

My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing.

I agree, my thought is that for now we can keep _device_specific_rand strictly internal. And we can expose it when we have better GPU support.

Re: flow test, happy to add it later.

Mar 18 '25 17:03 sunxd3

My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?

Mar 18 '25 17:03 Red-Portal

My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?

For now, randn is probably sufficient for most things we do. However, my taste on a NF library should be as simple and flexible as just specifying base dist q0, bijection B and its inverse, (and maybe the loss function) and the rest should all be handled by the library. The fact that we are restricted to Gaussian base dist is super ideal.

Mar 18 '25 17:03 zuhengxu

Let's introduce randn in a separate PR and merge this PR so we can move on.

Mar 20 '25 21:03 yebai

NormalizingFlows.jl NormalizingFlows.jl copied to clipboard

Initial work on CUDA-compat

NormalizingFlows.jl
NormalizingFlows.jl copied to clipboard