ProximalOperators.jl icon indicating copy to clipboard operation
ProximalOperators.jl copied to clipboard

GPU support via CuArrays.jl

Open kose-y opened this issue 5 years ago • 12 comments

Thanks for the nice package. I think it would be better to have GPU support (along with multi-thread evaluation for CPU in #82).

Edited by @lostella on Dec 7th, 2019

Note similar optimizations could be used in the computation of f and gradient, besides prox.

kose-y avatar Nov 24 '19 14:11 kose-y

Hi @kose-y, I’m not familiar with CuArrays, but from what I understand the type of arrays it provides is compatible with AbstractArray, and that should be sufficient for many operators in ProximalOperators to work.

Could you provide a specific example of a case where that doesn’t work as expected?

lostella avatar Nov 24 '19 16:11 lostella

Looping with direct elementwise access (e.g. https://github.com/kul-forbes/ProximalOperators.jl/blob/master/src/functions/normL2.jl#L37) is expected to be very slow on GPUs, and prox_naive would be much more efficient.

Adding prox_naive! and defining prox and prox! for CuArrays based on prox_naive would be a good way.

kose-y avatar Nov 24 '19 20:11 kose-y

Well in that specific case I think it makes sense to just remove that loop and instead rely on broadcasting. Other cases may be just as simple. Thanks for pointing this out!

lostella avatar Nov 24 '19 22:11 lostella

Thank you for the comment. I have to point out that this is not one specific case: I could quickly find many of them under the directory functions/. I will try to list them up when I have time.

kose-y avatar Nov 25 '19 11:11 kose-y

That would be nice, thanks. You can add the list as items to this issue, by editing the original post, so that progress towards solving this is tracked.

lostella avatar Nov 25 '19 20:11 lostella

@kose-y I’ve edited your post adding a list of proxes that could be improved. I’m not sure 100% on all of them, but I guess they’re worth a try.

I guess any attempt in solving this issue would need to come with evidence that performance improves with CuArray on GPU but doesn’t degrade with Array on CPU.

lostella avatar Dec 07 '19 14:12 lostella

@lostella Thank you for the listing. I was thinking about separate implementations for GPU because with multi-threading support (#82) in consideration, efficient implementations for CPU and GPU would not coincide. CuArrays would be an optional dependency in that case.

kose-y avatar Dec 07 '19 15:12 kose-y

That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).

lostella avatar Dec 07 '19 15:12 lostella

That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).

How did you write the broadcast? The simple version on the CPU:

y .= max.(f.lb, min.(f.ub, x))

takes for me with scalar lb/ub: broadcast: 7.837 μs loop: 43.034 μs (32 μs with @inbounds @simd) However, for vectors lb/ub i get broadcast; 75.665 μs loop: 47.566 μs (42 μs with @inbounds @simd) i.e slower in the second case. Ran with size 10000.

Edit: Interestingly enough, with Float32, I get Scalar: broadcast; 3.708 μs loop: 43.810 μs (1.564 μs with @inbounds @simd) Vector: broadcast: 122.975 μs
loop: 98.015 μs (95.285 μs with @inbounds @simd)

mfalt avatar Dec 07 '19 19:12 mfalt

@mfalt I used exactly the line you mentioned, and vectors lb/ub. I tried with size 10k and 100k, and observed a slowdown similar to yours. In the 100k case

  • loop: ~100us
  • broadcast: ~370us

lostella avatar Dec 07 '19 22:12 lostella

My takeaway so far: for serial computation, looping through the data just once wins, and should be preferred over broadcasting.

For IndBox with Array bounds (size = 100k), I'm getting no significant difference between single and double precision. My CPU: 2.5 GHz Intel Core i7.

  • Single loop (current implementation):
    • Float32: 103.812 μs (0 allocations: 0 bytes)
    • Float64: 104.059 μs (0 allocations: 0 bytes)
  • Broadcasting min/max: y .= min.(f.ub, max.(f.lb, x))
    • Float32: 376.113 μs (0 allocations: 0 bytes)
    • Float64: 376.062 μs (0 allocations: 0 bytes)
  • Broadcasting clamp: y .= clamp.(x, f.lb, f.ub)
    • Float32: 195.196 μs (0 allocations: 0 bytes)
    • Float64: 195.527 μs (0 allocations: 0 bytes)

When scalar bounds, things are a bit faster in all cases:

  • Single loop:
    • Float32: 84.856 μs (0 allocations: 0 bytes)
    • Float64: 84.885 μs (0 allocations: 0 bytes)
  • Broadcasting min/max:
    • Float32: 286.705 μs (0 allocations: 0 bytes)
    • Float64: 286.718 μs (0 allocations: 0 bytes)
  • Broadcasting clamp:
    • Float32: 121.591 μs (0 allocations: 0 bytes)
    • Float64: 114.609 μs (0 allocations: 0 bytes)

lostella avatar Dec 08 '19 13:12 lostella

It may be helpful to use @. to coalesce the broadcasting operations. https://docs.julialang.org/en/v1/base/arrays/#Base.Broadcast.@dot

For example: https://github.com/JuliaFirstOrder/ProximalOperators.jl/blob/5b2845a2d34f61a4e002ee28801ae1de5da8639f/src/functions/normL1.jl#L127

JeffFessler avatar Oct 06 '22 13:10 JeffFessler