Cxx.jl icon indicating copy to clipboard operation
Cxx.jl copied to clipboard

Function call overhead

Open barche opened this issue 8 years ago • 4 comments

In CxxWrap.jl there is a test to measure function call overhead in a tight loop. I added a Cxx.jl version for comparison: https://github.com/barche/CxxWrap.jl/blob/master/test/functions.jl#L99-L106

The test loops over a large array, calling the function on each element to divide it by 2. I expected Cxx.jl to be on-par with pure Julia and C++ (0.06 s on my machine) but instead it takes an abnormal 2 s. Any ideas what could be causing this anomaly?

The test was done on the rc3 binaries from the main site, Cxx.jl installed with Pkg.checkout and then Pkg.build. Versioninfo:

Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, sandybridge)

barche avatar Sep 06 '16 20:09 barche

I can't reproduce that. I see about a 1.5x overhead:

cxx"""
       double half_cxx(const double d)
       {
         return 0.5*d;
       }
"""
half_cxxjl(d::Float64) = @cxx half_cxx(d)
half_julia = x->0.5x

julia> @time map(half_julia, rand(Float64, 1000, 1000));
  0.003053 seconds (18 allocations: 15.260 MB, 29.04% gc time)

julia> @time map(half_cxxjl, rand(Float64, 1000, 1000));
  0.004764 seconds (18 allocations: 15.260 MB, 14.58% gc time)

Keno avatar Sep 07 '16 18:09 Keno

It seems to be related to the size of the array, using 1000x1000 I get results similar to yours, but just using a bigger array I get:

@time map(half_julia, rand(50000,1000))
0.164535 seconds (11 allocations: 762.940 MB, 23.45% gc time)

@time map(half_cxxjl, rand(50000,1000))
2.137991 seconds (11 allocations: 762.940 MB, 0.09% gc time)

barche avatar Sep 07 '16 20:09 barche

I tried this. The overhead looks a bit worse, but still nothing like the >10x you're seeing. Do make sure you're running things twice to avoid counting compilation overhead:

a = rand(100000,1000);
julia> @time map(half_cxxjl, a);
  0.330216 seconds (8 allocations: 762.940 MB, 0.50% gc time)
julia> @time map(half_julia, a);
  0.101759 seconds (8 allocations: 762.940 MB, 1.76% gc time)

I also looked at the LLVM IR and there's nothing suspicious in there. The performance overhead here is because cxx.jl functions are currently not inlined at the llvm level because julia does not run the inlining pass, but this benchmark is pretty much worst case anyway, so I'm not super concerned.

Keno avatar Sep 07 '16 21:09 Keno

Yes, the results were the fastest out of 3 consecutive calls. I just tried on my machine at work, which is almost the same except for the CPU and linux distro (Arch vs. fedora 23), and the two are much closer now: Cxx.jl:

0.401070 seconds (11 allocations: 762.940 MB, 10.58% gc time)

Pure Julia:

0.343312 seconds (11 allocations: 762.940 MB, 27.57% gc time)

The versioninfo for this machine:

Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, sandybridge)

Probably not worth pursuing this any further.

barche avatar Sep 08 '16 07:09 barche