Interpolations.jl icon indicating copy to clipboard operation
Interpolations.jl copied to clipboard

GPU support?

Open haampie opened this issue 5 years ago • 19 comments

I was experimenting with Lanczos interpolation (convolution of sinc(x)sinc(x/a)) of 3D data. In my use case I typically have O(n^3) data points and O(n^2) interpolation points. So far a very mediocre CUDA implementation seems to be faster than Interpolations.jl's quadratic b-spline:

https://nextjournal.com/stabbles/cuda-interpolation

I'm sure the CUDA implementation can be improved quite a bit. Would it be possible to run Interpolations.jl's functions on the GPU? I figure the bottleneck is the lu decomp mostly?

haampie avatar Feb 11 '20 00:02 haampie

Similar: #356. Have you profiled? Single-threaded I see slightly more time for the interpolation itself than for the prefiltering; the random access pattern is pretty murderous on cache performance.

Would be great to have GPU versions under a @require.

timholy avatar Feb 25 '20 09:02 timholy

There's been some recent work on (nearest neighbor/linear/cubic) texture interpolation (https://github.com/JuliaGPU/CUDA.jl/pull/460#issuecomment-701441464), although the API is still pretty rough/experimental. It might already be enough to be worth hooking into Interpolations.jl, though.

stillyslalom avatar Jan 15 '21 21:01 stillyslalom

CUDA.jl appears to import Interpolations.jl. That is CUDA.jl appears to hook into Interpolations.jl rather than vice versa.

Are you proposing going the other way around? Perhaps there should be a distinct CUDAInterpolations.jl package which implements the intersection of the two packages?

mkitti avatar Jan 15 '21 21:01 mkitti

No, right now Interpolations.jl is only used for testing: https://github.com/JuliaGPU/CUDA.jl/search?q=using+interpolations

The standard interface design seems to be, as @timholy noted, siloing additional functionality within a module that's loaded by @require.

stillyslalom avatar Jan 15 '21 23:01 stillyslalom

I see, Interpolations is just imported there just for the tests.

https://github.com/JuliaGPU/CUDA.jl/blob/465535069383ce66b137c6c64f188e4b9164ec15/test/texture.jl

I'm still leaning towards a distinct CUDAInterpolations.jl package that would depend on Interpolations and CUDA but extend the Interpolations interface for CuArray. The main reason for this is that package would require a distinct continuous integration infrastructure that this package does not currently need.

mkitti avatar Jan 15 '21 23:01 mkitti

It's totally reasonable to try to keep the CI infra simple for this package. Even with the functionality in a separate package, though, it should still be possible to use Requires.jl to automatically load any relevant CUDAInterpolations.jl methods without needing to explicitly load the package.

stillyslalom avatar Jan 16 '21 00:01 stillyslalom

It's totally reasonable to try to keep the CI infra simple for this package. Even with the functionality in a separate package, though, it should still be possible to use Requires.jl to automatically load any relevant CUDAInterpolations.jl methods without needing to explicitly load the package.

I'll have to see what CUDAInterpolations.jl ends up looking like. I usually have to manage host <-> device transfers pretty carefully so that overhead does not overwhelm the processing speed up.

mkitti avatar Jan 16 '21 01:01 mkitti

For my use case, the data would be on the GPU in the first place - no need for transfer. It might be enough to provide constructors like interpolate(a::CuArray{T}, options...) - I don't know if you can load textures directly into texture memory from the CPU.

cc @maleadt

stillyslalom avatar Jan 16 '21 01:01 stillyslalom

For my use case, the data would be on the GPU in the first place - no need for transfer. It might be enough to provide constructors like interpolate(a::CuArray{T}, options...) - I don't know if you can load textures directly into texture memory from the CPU.

Yes, you can bind a texture object from a pre-existing CuArray. If you're uploading from the host, its better to create a special CuTextureArray (which is padded so that fetches perform better). It's also possible to copy a CuArray to a CuTextureArray (this doesn't leave the device, but there's still some overhead of course). See https://github.com/JuliaGPU/CUDA.jl/blob/0679e2feba777ac54299c1fbd693263186bead53/test/texture.jl#L41-L100 for examples.

maleadt avatar Jan 18 '21 07:01 maleadt

Great. I'm pretty sure a distinct CUDAInterpolations.jl approach still makes the most sense. It would need to run under the JuliaGPU testing infrastructure.

mkitti avatar Jan 18 '21 08:01 mkitti

Nearest-neighbor, bilinear, and kernel-y (Lanczos et al) interpolants should be pretty straightforward, but quadratic/cubic B-spline interpolants will require reworking the prefiltering step for GPU compatibility. The authors of this paper on prefiltering made their code available here under a BSD license - not sure whether there's any advantage to rewriting it.

stillyslalom avatar Jan 18 '21 20:01 stillyslalom

As a first step, it would be nice to even have an interpolation function that runs on the GPU but is constructed on the CPU. I assume that's probably a simpler process to vectorize? I got a bit confused trying to walk through the code myself.

jmulcahy avatar Mar 02 '21 05:03 jmulcahy

@jmulcahy ExBroadcast.jl offer such support (interpolation on GPU which is constructed on the CPU) via broadcast interface. If you use CUDA.jl, example codes are:

using ExBroadcast, Interpolations, CUDA, Adapt
a = randn(ComplexF32,401,401)
itp = CubicSplineInterpolation((-100:0.5f0:100,-100:0.5f0:100), a) # only BSpline is tested (Lanczos should also work)
cuitp = adapt(CuArray,itp)
cuitp.(-150:0.5f0:100,(-100:0.5f0:100)') # 20001×20001 CuArray{ComplexF32, 2}
cuitp.(CUDA.randn(100),CUDA.randn(100)) # 100 CuArray{ComplexF32, 1}
cuitp.(randn(100),randn(100)) # error

If you use other GPU package, you can find relate code in /src/other_support/interp.jl and you can make your own patch.

N5N3 avatar May 26 '21 08:05 N5N3

I would be very interested to have something for AMDGPU. Any pointers are appreciated :)

koehlerson avatar Nov 13 '21 10:11 koehlerson

If you just want to do interp on GPU with a kernal constructed on CPU, define the adapt_structure for the AbstractInterpolation is enough. i.e.:

import Adapt: adapt_structure, adapt
adapt_structure(to, itp::BSplineInterpolation{T,N}) where {T,N} = begin
    coefs, parentaxes, it = itp.coefs, itp.parentaxes, itp.it
    coefs′ = adapt(to, coefs)
    Para = map(typeof, (coefs′,it,parentaxes))
    BSplineInterpolation{T,N,Para...}(coefs′, parentaxes, it)
end
adapt_structure(to, itp::Extrapolation{T,N}) where {T,N} = begin
    et = itp.et
    itp′ = adapt(to,  itp.itp)
    IT = itptype(itp)
    Extrapolation{T,N,typeof(itp′),IT,typeof(et)}(itp′, et)
end

And use the broadcast system to calculate the result like above

N5N3 avatar Nov 13 '21 11:11 N5N3

ExBroadcast fails for me in precompilation and I guess it's needed in order to work?

https://github.com/N5N3/ExBroadcast.jl/blob/main/src/util.jl#L106 Base.@invoke is not defined

koehlerson avatar Nov 13 '21 12:11 koehlerson

Well ExBroadcast.jl is a package for self-use so I didnt ensure compatablity. Base.@invoke is introduced in a newer Julia release. The code you need is all included in https://github.com/N5N3/ExBroadcast.jl/blob/main/src/other_support/interp.jl Put them in a .jl file --> remove L52 and L110 --> include the .jl file should work.

N5N3 avatar Nov 13 '21 12:11 N5N3

So 1.7? I'm on 1.6.1.

I'll give it a try, thanks for sharing the information!

koehlerson avatar Nov 13 '21 12:11 koehlerson

I took a look at this and looks like at least for BSpline its just a missing adapt method and it works, and able to saturate the GPU (an A100 here) to 100%, giving a 50X speed-boost:

using Interpolations, CUDA, CUDA.Adapt, BenchmarkTools

function Adapt.adapt_structure(to, itp::Interpolations.BSplineInterpolation{T,N,<:Any,IT,Axs}) where {T,N,IT,Axs}
    coefs = Adapt.adapt_structure(to, itp.coefs)
    Tcoefs = typeof(coefs)
    Interpolations.BSplineInterpolation{T,N,Tcoefs,IT,Axs}(coefs, itp.parentaxes, itp.it)
end

y = rand(10)
itp = interpolate(y, BSpline(Quadratic(Reflect(OnCell()))))
cu_itp = cu(itp);
new_x = collect(range(1,10,length=1000000))
cu_new_x = cu(new_x)

@btime itp.(new_x) # 2.5ms
@btime CUDA.@sync cu_itp.(cu_new_x) # 46μs

I unfortunately don't have the time right now to do this for everything / PR, but hopefully someone finds this helpful.

marius311 avatar Jun 14 '22 06:06 marius311

closed by #504 ?

longemen3000 avatar May 20 '23 05:05 longemen3000