Faster implementations
As we discussed, there are several ways to speed up the implementations of the prox operators. I did some tests that may be useful in rewriting them. These are the results for the NormL1Prox, for 100 prox operations on a vector of size 1000000. Standard implementation
function prox(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
y = sign(x).*max(0.0, abs(x)-gamma*f.lambda)
return y
end
In-place update to avoid big allocations:
function prox1!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
for i in eachindex(x)
x[i] = sign(x[i]).*max(0.0, abs(x[i])-gamma*f.lambda)
end
return x
end
Avoid bounds-checks and use simd:
function prox2!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
gf = gamma*f.lambda
@inbounds @simd for i in eachindex(x)
x[i] = sign(x[i]).*max(0.0, abs(x[i])-gf)
end
return x
end
Avoid function calls and all but one multiplication:
function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
gf = gamma*f.lambda
@inbounds @simd for i in eachindex(x)
x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
end
return x
end
Use ifelse to ensure that the compiler is not worried about taking both branches
function prox4!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
gf = gamma*f.lambda
@inbounds @simd for i in eachindex(x)
x[i] += ifelse(x[i] <= -gf, gf , ifelse(x[i] >= gf , -gf , -x[i]))
end
return x
end
Results: prox: 1.557599 seconds (1.70 k allocations: 3.725 GB, 8.86% gc time) prox1!: 0.235333 seconds (100 allocations: 1.563 KB) (6.5x speedup) prox2!: 0.085790 seconds (100 allocations: 1.563 KB) (2.7 x speedup) prox3!: 0.080576 seconds (100 allocations: 1.563 KB) (1.05x speedup) (total 19x speedup) prox4!: 0.074251 seconds (100 allocations: 1.563 KB) (1.08x speedup) (total 21x speedup)
What does @simd exactly do? The following remark in the Julia documentation makes me suspicious for some reason:
This feature is experimental and could change or disappear in future versions of Julia.
Thank you for noting down all these tips!
Most processors today are able to perform operations on a chunk of data, often 8 units (4 float64s), and this macro mainly tells the compiler that it is all right to convert the code to use these "Single Instruction Multiple Data" instructions. I am not sure how experimental this actually is, and it's easy to remove if they for some reason change the syntax. It was actually also possible to get a substantial speedup without @simd by manually unrolling the loop, something like:
function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
gf = gamma*f.lambda
@inbounds for i in 1:4:(floor(Int64,length(x)/4)*4)
x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
x[i+1] += (x[i+1] <= -gf ? gf : (x[i+1] >= gf ? -gf : -x[i+1]))
x[i+2] += (x[i+2] <= -gf ? gf : (x[i+2] >= gf ? -gf : -x[i+2]))
x[i+3] += (x[i+3] <= -gf ? gf : (x[i+3] >= gf ? -gf : -x[i+3]))
end
@inbounds for i in (floor(Int64,length(x)/4)*4+1):length(x)
x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
end
return x
end
but I think @simd is a better option (and faster).
After reading some more about simd (https://software.intel.com/en-us/articles/vectorization-in-julia) I was able to speed it up a bit more and added the example to the original post.