ProximalOperators.jl
ProximalOperators.jl copied to clipboard
GPU support via CuArrays.jl
Thanks for the nice package. I think it would be better to have GPU support (along with multi-thread evaluation for CPU in #82).
Edited by @lostella on Dec 7th, 2019
Note similar optimizations could be used in the computation of f
and gradient
, besides prox
.
- [ ] NormL2
- [ ] NormL1
- [ ] SqrNormL2
- [ ] HuberLoss
- [ ] SumPositive
- [ ] SqrHingeLoss
- [ ] IndBox
- [ ] CrossEntropy
- [ ] ElasticNet
Hi @kose-y, I’m not familiar with CuArrays, but from what I understand the type of arrays it provides is compatible with AbstractArray, and that should be sufficient for many operators in ProximalOperators to work.
Could you provide a specific example of a case where that doesn’t work as expected?
Looping with direct elementwise access (e.g. https://github.com/kul-forbes/ProximalOperators.jl/blob/master/src/functions/normL2.jl#L37) is expected to be very slow on GPUs, and prox_naive
would be much more efficient.
Adding prox_naive!
and defining prox
and prox!
for CuArrays
based on prox_naive
would be a good way.
Well in that specific case I think it makes sense to just remove that loop and instead rely on broadcasting. Other cases may be just as simple. Thanks for pointing this out!
Thank you for the comment. I have to point out that this is not one specific case: I could quickly find many of them under the directory functions/
. I will try to list them up when I have time.
That would be nice, thanks. You can add the list as items to this issue, by editing the original post, so that progress towards solving this is tracked.
@kose-y I’ve edited your post adding a list of proxes that could be improved. I’m not sure 100% on all of them, but I guess they’re worth a try.
I guess any attempt in solving this issue would need to come with evidence that performance improves with CuArray on GPU but doesn’t degrade with Array on CPU.
@lostella Thank you for the listing. I was thinking about separate implementations for GPU because with multi-threading support (#82) in consideration, efficient implementations for CPU and GPU would not coincide. CuArrays
would be an optional dependency in that case.
That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).
That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).
How did you write the broadcast? The simple version on the CPU:
y .= max.(f.lb, min.(f.ub, x))
takes for me with scalar lb/ub:
broadcast: 7.837 μs
loop: 43.034 μs (32 μs with @inbounds @simd
)
However, for vectors lb/ub i get
broadcast; 75.665 μs
loop: 47.566 μs (42 μs with @inbounds @simd
)
i.e slower in the second case. Ran with size 10000.
Edit: Interestingly enough, with Float32, I get
Scalar:
broadcast; 3.708 μs
loop: 43.810 μs (1.564 μs with @inbounds @simd
)
Vector:
broadcast: 122.975 μs
loop: 98.015 μs (95.285 μs with @inbounds @simd
)
@mfalt I used exactly the line you mentioned, and vectors lb/ub. I tried with size 10k and 100k, and observed a slowdown similar to yours. In the 100k case
- loop: ~100us
- broadcast: ~370us
My takeaway so far: for serial computation, looping through the data just once wins, and should be preferred over broadcasting.
For IndBox
with Array bounds (size = 100k), I'm getting no significant difference between single and double precision. My CPU: 2.5 GHz Intel Core i7.
- Single loop (current implementation):
- Float32: 103.812 μs (0 allocations: 0 bytes)
- Float64: 104.059 μs (0 allocations: 0 bytes)
- Broadcasting min/max:
y .= min.(f.ub, max.(f.lb, x))
- Float32: 376.113 μs (0 allocations: 0 bytes)
- Float64: 376.062 μs (0 allocations: 0 bytes)
- Broadcasting clamp:
y .= clamp.(x, f.lb, f.ub)
- Float32: 195.196 μs (0 allocations: 0 bytes)
- Float64: 195.527 μs (0 allocations: 0 bytes)
When scalar bounds, things are a bit faster in all cases:
- Single loop:
- Float32: 84.856 μs (0 allocations: 0 bytes)
- Float64: 84.885 μs (0 allocations: 0 bytes)
- Broadcasting min/max:
- Float32: 286.705 μs (0 allocations: 0 bytes)
- Float64: 286.718 μs (0 allocations: 0 bytes)
- Broadcasting clamp:
- Float32: 121.591 μs (0 allocations: 0 bytes)
- Float64: 114.609 μs (0 allocations: 0 bytes)
It may be helpful to use @.
to coalesce the broadcasting operations.
https://docs.julialang.org/en/v1/base/arrays/#Base.Broadcast.@dot
For example: https://github.com/JuliaFirstOrder/ProximalOperators.jl/blob/5b2845a2d34f61a4e002ee28801ae1de5da8639f/src/functions/normL1.jl#L127