NormalizingFlows.jl
NormalizingFlows.jl copied to clipboard
Initial work on CUDA-compat
It seems overloading an external package in an extension doesn't work (which is probably for the better), so atm the CUDA tests are failing.
But if we move the overloads into the main package, they run. So probably should do that from now on.
I think now the CUDAext works properly, current tests about cuda all passes. The following code runs properly:
using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using NormalizingFlows
rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)
CUDA.functional()
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)
x = rand(rng, q0_g) # good
However, there is still issue to fix---sample multiple samples at once, and sample from Bijectors.TransformedDistribuition . Minimal examples are as follows:
- sample multiple samples in one batch
xs = rand(rng, q0_g, 10) # ambiguous
error message:
ERROR: MethodError: rand(::CUDA.RNG, ::MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64) is ambiguous.
Candidates:
rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate, Continuous}, n::Int64)
@ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:23
rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate}, n::Int64)
@ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:21
rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous}, n::Int64)
@ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:16
Possible fix, define
rand(::CUDA.RNG, ::Sampleable{Multivariate, Continuous}, ::Int64)
Stacktrace:
[1] top-level scope
@ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:42
- sample from
Bijectors.TransformedDistribution:
y = rand(rng, flow_g) # ambiguous
err meesage:
ERROR: MethodError: rand(::CUDA.RNG, ::MultivariateTransformed{MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ComposedFunction{PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}) is ambiguous.
Candidates:
rand(rng::Random.AbstractRNG, td::MultivariateTransformed)
@ Bijectors ~/.julia/packages/Bijectors/cvMxj/src/transformed_distribution.jl:160
rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous})
@ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:7
Possible fix, define
rand(::CUDA.RNG, ::MultivariateTransformed)
Stacktrace:
[1] top-level scope
@ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:40
This is partially because we are overloading methods and types that do not own by this pkg. Any thoughts about how to address this @torfjelde @sunxd3?
I don't have a immediate solution other than the suggested fixes.
It is indeed a bit annoying, maybe we don't dispatch on rng?
It is indeed a bit annoying, maybe we don't dispatch on rng?
Yeah, I agree. For temporary solution, I'm thinking adding an additional argument for Distribution.rand, something like device to indicate on cpu or on gpu. But for long term fix, Im now leaning towards your previous attempts. Although this will require resolving some compatibility issue with Bijectors.
Honestly, IMO, the best solution right now is just to add our own rand for now to avoid ambiguity errors.
If we want to properly support all of this, we'll have to go down the path of specializing the methods further, i.e. not do a Union as we've done now, which will take time and effort.
For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.
How does that sound?
For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.
Yeah, after thinking about it, I agree that this is probably the best way to go at this point. Working on it now!
I have adapted the NF.rand_device() approach. I think now we have a work around. The following code runs properly:
using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using Flux
import NormalizingFlows as NF
rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)
CUDA.functional()
ts = reduce(∘, [f32(Bijectors.PlanarLayer(2)) for _ in 1:2])
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)
@torfjelde @sunxd3 Let me know if this attempt looks good to you. If so, I'll update the docs.
@torfjelde @zuhengxu, can you try to finish this PR if nothing major stands?
@yebai @torfjelde @sunxd3 I’m wondering if it might be better to start by integrating Lux.jl and Reactant.jl (#40). This would require some modifications to the interface and how we interact with Bijectors.jl. Specifically, should we keep everything wrapped within the Bijectors.jl interface as we currently do, or should we consider relaxing the existing NF API? (If we can achieve the former without too much effort, that would be ideal!)
My main argument for full integration with Lux.jl and Reactant.jl is that it would make switching between different devices much more seamless.
What are your thoughts on this?
I am for start working on integration with Lux (and Reactant because Lux seems to work best with it). I would need to do some experiments before I can give some more informed opinions. But for this PR, ~it think it'll take too much effort to push it to anywhere complete~ (edit: this was referring to GPU, but after closer look, I think the scope of this PR is good, so I can help tidy up).
@sunxd3, can you extract the rand_device part for a new PR? One needs compatibility with CUDA.jl since it is the most mature GPU library in Julia
Integration with other libraries is a good idea, but it should be done after fixing the compatibility of CUDA.jl.
Some benchmark: https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc
I'm very sorry - I don't think I have the knowledge to review this. Happy to try to look at specific questions if you have any though.
I'm very sorry - I don't think I have the knowledge to review this.
No worries, @penelopeysm. It might take you a while to become familiar with TuringLang libraries. In the meantime, please feel free to comment from the RSE perspective!
Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?
Thank you for all the comments and feedbacks! Here are some of my thoughts.
Sorry for being late for the party; in my experience making
AdvancedVIGPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it intoAdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied torand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?
I don't think it works; or I could be not doing it the right way. The following code doesn't work:
using Random, Distributions, LinearAlgebra
using CUDA
rng_g = CUDA.default_rng()
q0_g = MvNormal(CuArray(zeros(T, 2)), CuArray(ones(T, 2)))
rand(rng_g, q0, 10)
This will return a cpu array
julia> rand(rng_g, q0, 10)
2×10 Matrix{Float32}:
-0.322776 0.129745 1.08983 1.06144 -1.09644 -0.206741 0.273979 -0.518334 -1.35439 -1.79912
0.259409 -0.583322 0.601871 0.571779 1.41544 -1.19601 0.0845693 0.766666 -0.0962037 0.0355074
that's why we need to have this _device_specific_rand function that can dispatch on the rngs.
Since we're only using Gaussians, randn should be more useful to be precise.
This part I agree. @sunxd3 maybe we can leverage the existing gpu compatibility of randn, e.g.,
julia> rng_g = CUDA.default_rng()
CUDA.RNG(0xd97bf5d1, 0x0000004a)
julia> T = Float64
Float64
julia> randn(rng_g, T, 2, 10)
2×10 CuArray{Float64, 2, CUDA.DeviceMemory}:
-1.26503 -0.152555 0.921247 -1.05752 0.581864 -2.30691 0.26545 -1.1928 0.061966 -1.42031
-1.81811 -0.393087 -1.85974 -1.39682 0.349741 -0.624886 -0.191652 -0.347422 -1.4505 1.74004
Yeah avoiding Distributions.jl like the plague is necessary.
yep, randn is used at https://github.com/TuringLang/NormalizingFlows.jl/blob/3f07fe54784e795cc3669a78363fc46c4a0cdffb/ext/NormalizingFlowsCUDAExt.jl#L43-L48
The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.
yep,
randnis used athttps://github.com/TuringLang/NormalizingFlows.jl/blob/3f07fe54784e795cc3669a78363fc46c4a0cdffb/ext/NormalizingFlowsCUDAExt.jl#L43-L48
The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.
Yeah, overall the code under the hood looks great to me (thanks again @sunxd3!). My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing. I think wrap this into a more intuitive name and dispatch on the rng type would be more inituitive.
Aside from this, since the sampling function and logpdf function works with cuda, let's add some test with the whole pipline to see if the planar flow training and evaluation on GPU all work properly. For example, run this planar flow test with GPU.
My only concern is that the sampling function that users get to call is
_device_specific_rand, which is not very pleasing.
I agree, my thought is that for now we can keep _device_specific_rand strictly internal. And we can expose it when we have better GPU support.
Re: flow test, happy to add it later.
My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?
My thought is that we don't have to generalize to allow any
Distributions.jldistribution and just userandndirectly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?
For now, randn is probably sufficient for most things we do. However, my taste on a NF library should be as simple and flexible as just specifying base dist q0, bijection B and its inverse, (and maybe the loss function) and the rest should all be handled by the library. The fact that we are restricted to Gaussian base dist is super ideal.
Let's introduce randn in a separate PR and merge this PR so we can move on.