ProximalOperators.jl Faster implementations

As we discussed, there are several ways to speed up the implementations of the prox operators. I did some tests that may be useful in rewriting them. These are the results for the NormL1Prox, for 100 prox operations on a vector of size 1000000. Standard implementation

function prox(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  y = sign(x).*max(0.0, abs(x)-gamma*f.lambda)
  return y
end

In-place update to avoid big allocations:

function prox1!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  for i in eachindex(x)
    x[i] = sign(x[i]).*max(0.0, abs(x[i])-gamma*f.lambda)
  end
  return x
end

Avoid bounds-checks and use simd:

function prox2!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds @simd for i in eachindex(x)
   x[i] = sign(x[i]).*max(0.0, abs(x[i])-gf)
  end
  return x
end

Avoid function calls and all but one multiplication:

function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds @simd for i in eachindex(x)
    x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
  end
  return x
end

Use ifelse to ensure that the compiler is not worried about taking both branches

function prox4!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds @simd for i in eachindex(x)
    x[i] += ifelse(x[i] <= -gf, gf , ifelse(x[i] >= gf , -gf , -x[i]))
  end
  return x
end

Results: prox: 1.557599 seconds (1.70 k allocations: 3.725 GB, 8.86% gc time) prox1!: 0.235333 seconds (100 allocations: 1.563 KB) (6.5x speedup) prox2!: 0.085790 seconds (100 allocations: 1.563 KB) (2.7 x speedup) prox3!: 0.080576 seconds (100 allocations: 1.563 KB) (1.05x speedup) (total 19x speedup) prox4!: 0.074251 seconds (100 allocations: 1.563 KB) (1.08x speedup) (total 21x speedup)

Sep 15 '16 11:09 mfalt

What does @simd exactly do? The following remark in the Julia documentation makes me suspicious for some reason:

This feature is experimental and could change or disappear in future versions of Julia.

Thank you for noting down all these tips!

Sep 15 '16 23:09 lostella

Most processors today are able to perform operations on a chunk of data, often 8 units (4 float64s), and this macro mainly tells the compiler that it is all right to convert the code to use these "Single Instruction Multiple Data" instructions. I am not sure how experimental this actually is, and it's easy to remove if they for some reason change the syntax. It was actually also possible to get a substantial speedup without @simd by manually unrolling the loop, something like:

function prox3!(f::NormL1{Float64}, x::RealOrComplexArray, gamma::Float64=1.0)
  gf = gamma*f.lambda
  @inbounds for i in 1:4:(floor(Int64,length(x)/4)*4)
    x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
    x[i+1] += (x[i+1] <= -gf ? gf : (x[i+1] >= gf ? -gf : -x[i+1]))
    x[i+2] += (x[i+2] <= -gf ? gf : (x[i+2] >= gf ? -gf : -x[i+2]))
    x[i+3] += (x[i+3] <= -gf ? gf : (x[i+3] >= gf ? -gf : -x[i+3]))
  end

  @inbounds for i in (floor(Int64,length(x)/4)*4+1):length(x)
    x[i] += (x[i] <= -gf ? gf : (x[i] >= gf ? -gf : -x[i]))
  end
  return x
end

but I think @simd is a better option (and faster).

Sep 16 '16 07:09 mfalt

After reading some more about simd (https://software.intel.com/en-us/articles/vectorization-in-julia) I was able to speed it up a bit more and added the example to the original post.

Sep 16 '16 08:09 mfalt