Cxx.jl
Cxx.jl copied to clipboard
Function call overhead
In CxxWrap.jl
there is a test to measure function call overhead in a tight loop. I added a Cxx.jl
version for comparison:
https://github.com/barche/CxxWrap.jl/blob/master/test/functions.jl#L99-L106
The test loops over a large array, calling the function on each element to divide it by 2. I expected Cxx.jl
to be on-par with pure Julia and C++ (0.06 s on my machine) but instead it takes an abnormal 2 s. Any ideas what could be causing this anomaly?
The test was done on the rc3 binaries from the main site, Cxx.jl
installed with Pkg.checkout
and then Pkg.build
. Versioninfo:
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, sandybridge)
I can't reproduce that. I see about a 1.5x overhead:
cxx"""
double half_cxx(const double d)
{
return 0.5*d;
}
"""
half_cxxjl(d::Float64) = @cxx half_cxx(d)
half_julia = x->0.5x
julia> @time map(half_julia, rand(Float64, 1000, 1000));
0.003053 seconds (18 allocations: 15.260 MB, 29.04% gc time)
julia> @time map(half_cxxjl, rand(Float64, 1000, 1000));
0.004764 seconds (18 allocations: 15.260 MB, 14.58% gc time)
It seems to be related to the size of the array, using 1000x1000 I get results similar to yours, but just using a bigger array I get:
@time map(half_julia, rand(50000,1000))
0.164535 seconds (11 allocations: 762.940 MB, 23.45% gc time)
@time map(half_cxxjl, rand(50000,1000))
2.137991 seconds (11 allocations: 762.940 MB, 0.09% gc time)
I tried this. The overhead looks a bit worse, but still nothing like the >10x you're seeing. Do make sure you're running things twice to avoid counting compilation overhead:
a = rand(100000,1000);
julia> @time map(half_cxxjl, a);
0.330216 seconds (8 allocations: 762.940 MB, 0.50% gc time)
julia> @time map(half_julia, a);
0.101759 seconds (8 allocations: 762.940 MB, 1.76% gc time)
I also looked at the LLVM IR and there's nothing suspicious in there. The performance overhead here is because cxx.jl functions are currently not inlined at the llvm level because julia does not run the inlining pass, but this benchmark is pretty much worst case anyway, so I'm not super concerned.
Yes, the results were the fastest out of 3 consecutive calls. I just tried on my machine at work, which is almost the same except for the CPU and linux distro (Arch vs. fedora 23), and the two are much closer now: Cxx.jl:
0.401070 seconds (11 allocations: 762.940 MB, 10.58% gc time)
Pure Julia:
0.343312 seconds (11 allocations: 762.940 MB, 27.57% gc time)
The versioninfo for this machine:
Julia Version 0.5.0-rc3+0
Commit e6f843b (2016-08-22 23:43 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, sandybridge)
Probably not worth pursuing this any further.