Turing.jl icon indicating copy to clipboard operation
Turing.jl copied to clipboard

Simple benchmarking for Tracker, ReverseDiff, and Zygote

Open yebai opened this issue 4 years ago • 27 comments

The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency: ReverseDiff > Tracker > Zygote.

julia> mysum(x) = begin z = 0; for i =1:length(x); z += x[i]; end; return z; end
mysum (generic function with 1 method)

julia> using Zygote, Tracker, ReverseDiff, BenchmarkTools

julia> x = rand(10000);

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum, x);
  491.095 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum, x);
  976.786 ms (280102 allocations: 1.50 GiB)

Benchmark setup:


(v1.2) pkg> st
    Status `~/.julia/environments/v1.2/Project.toml`
  [0bf59076] AdvancedHMC v0.2.20
  [131c737c] ArviZ v0.2.4
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [5d742f6a] CSVFiles v1.0.0
  [8f4d0f93] Conda v1.3.0
  [a93c6f00] DataFrames v0.20.0
  [163ba53b] DiffResults v1.0.2
  [31c24e10] Distributions v0.21.12
  [bbc10e6e] DynamicHMC v2.1.3
  [f6369f11] ForwardDiff v0.10.8
  [7073ff75] IJulia v1.20.2
  [e5e0dc1b] Juno v0.7.2
  [c7f686f2] MCMCChains v1.0.0
  [91a5bcdd] Plots v0.28.4
  [438e738f] PyCall v1.91.2
  [37e2e3b7] ReverseDiff v1.1.0
  [2913bbd2] StatsBase v0.32.0
  [4c63d2b9] StatsFuns v0.9.3
  [f3b207a7] StatsPlots v0.13.0
  [1d978283] TensorFlow v0.11.0
  [9f7883ad] Tracker v0.2.6
  [fce5fe82] Turing v0.8.2
  [0f1e0344] WebIO v0.8.13
  [e88e6eb3] Zygote v0.4.9
  [ade2ca70] Dates 


yebai avatar Mar 02 '20 15:03 yebai

cc: @phipsgabler

trappmartin avatar Mar 04 '20 13:03 trappmartin

There is a more comprehensive one avaiable here: https://github.com/TuringLang/TuringExamples/blob/kx/improve_functionality/benchmarks/auto_diff/loop-vs-unrolled.ipynb

xukai92 avatar Mar 04 '20 13:03 xukai92

Numbers using PyTorch's via ThArrays

julia> using ThArrays

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

yebai avatar Mar 26 '20 13:03 yebai

This example might be over-favourable to ReverseDiff. @mohamed82008 @xukai92

yebai avatar Mar 26 '20 13:03 yebai

@KDr2 can you check whether ThArrays have any type-instability issue?

julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
  #self#::Core.Compiler.Const(ThArrays.gradient, false)
  f::Core.Compiler.Const(mysum, false)
  data::Tuple{Array{Float64,1}}

Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1 ─ %1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)
│   %2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└──      return %2

yebai avatar Mar 26 '20 13:03 yebai

@yebai define

+tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing)

and

-function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
     outputs__ = Int[0]
     __cret = ccall((:atg_grad, :libtorch_capi),
                  Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
                  outputs__, self.pointer)
-    __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+    __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
     return __o_1
 end

can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results.

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

KDr2 avatar Mar 26 '20 14:03 KDr2

Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?

Can you code this example in C++, and report the run time?

yebai avatar Mar 26 '20 14:03 yebai

#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>

#include <typeinfo>
#include <iostream>

int main() {
    torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
    torch::Tensor r(t[0].clone());
    r.set_requires_grad(1);

    for(int i=0; i< 10000; i++ ) {
        r += t[i];
    }
    r.backward();
    // std::cout << t.grad() << std::endl;
    t.grad();
    return 0;
}

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

On my machine, it takes 0.5s while the Julia version takes 1.25s.

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

If you use sum instead of mysum, Julia and C++ are at the same level.

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

KDr2 avatar Mar 26 '20 15:03 KDr2

you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.

Are you running this Windows?

yebai avatar Mar 26 '20 15:03 yebai

On my machine, it takes 0.5s while the Julia version takes 1.25s.

This result (0.5s) seems consistent with julia when using BenchmarkTools:

julia> @btime  ThArrays.gradient(mysum, x);
  477.310 ms (160017 allocations: 7.32 MiB)

yebai avatar Mar 26 '20 15:03 yebai

No, on linux, it's a typo, it's an ELF, sorry for the confusion.

$ file example-app
example-app: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4cbe6786f7f899d38c1cc6be65e7435aba5faa0b, for GNU/Linux 3.2.0, not stripped

On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++.

KDr2 avatar Mar 26 '20 15:03 KDr2

julia> @btime ThArrays.gradient(mysum, x);
  1.257 s (188995 allocations: 7.92 MiB)

Julia>

$ time ./example-app >/dev/null

real    0m0.519s
user    0m0.491s
sys     0m0.013s

KDr2 avatar Mar 26 '20 15:03 KDr2

thanks, @KDr2

julia> @btime ReverseDiff.gradient(mysum, x);
  2.915 ms (70022 allocations: 2.92 MiB)

Based on the C++ runtime for PyTorch, maybe the superiority of ReverseDiff is mainly from tape-caching, which avoids duplicate memory allocation?

yebai avatar Mar 26 '20 15:03 yebai

I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.

This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python).

yebai avatar Mar 26 '20 15:03 yebai

I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.

I'm wondering if adding @inbounds to mysum could improve the performance of the Julia version?

devmotion avatar Mar 26 '20 15:03 devmotion

In real-world practice, we may not do indexing in such a frequency, e.g. we should use sum instead of mysum.

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

KDr2 avatar Mar 26 '20 15:03 KDr2

About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.

That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only.

yebai avatar Mar 26 '20 15:03 yebai

The numbers are slightly better with @inbounds


julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
  2.436 ms (70022 allocations: 2.92 MiB)

julia> @btime Tracker.gradient(mysum_inbounds, x);
  439.518 ms (150016 allocations: 767.59 MiB)

julia> @btime Zygote.gradient(mysum_inbounds, x);
  856.985 ms (280102 allocations: 1.50 GiB)

julia> @btime ThArrays.gradient(mysum_inbounds, x);
  465.540 ms (160017 allocations: 7.32 MiB)

yebai avatar Mar 26 '20 17:03 yebai

I guess you should also use $ to interpolate x. It might affect the results since you're benchmarking with a global variable.

devmotion avatar Mar 26 '20 17:03 devmotion

thanks @devmotion, I rerun the benchmarks with $x. The results aren't very different from above ones.

Interestingly, I did another test to separate the effects of indexing and loops, which produces:

julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)

julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
  2.396 ms (60021 allocations: 2.23 MiB)

julia> @btime Tracker.gradient(mysum2, [1.0]);
  945.712 μs (60022 allocations: 1.83 MiB)

julia> @btime Zygote.gradient(mysum2, [1.0]);
  7.260 ms (100108 allocations: 3.54 MiB)

julia> @btime ThArrays.gradient(mysum2, [1.0]);
  69.599 ms (40026 allocations: 1.68 MiB)

This result together with the above ones suggest:

  • Zygote isn't that bad with loops (see https://github.com/FluxML/Zygote.jl/issues/371 and https://github.com/FluxML/Zygote.jl/issues/312), but bad at indexing (see https://github.com/FluxML/Zygote.jl/issues/365)
  • ReverseDiff is ok with both loops and indexing
  • ~~Tracker is quite bad with loops, not sure about indexing.~~
  • Tracker is good with loops, but quite bad with indexing
  • PyTorch / ThArrays is ok with loops but bad with indexing.

Ps

  • Zygote is slow with broadcasting: https://github.com/FluxML/Zygote.jl/issues/592

yebai avatar Mar 26 '20 20:03 yebai

Tracker is quite bad with loops, not sure about indexing.

Tracker is actually the fastest in the loop benchmark, isn't it?

devmotion avatar Mar 26 '20 20:03 devmotion

Tracker is actually the fastest in the loop benchmark, isn't it?

You're right! I didn't notice the unit!

yebai avatar Mar 26 '20 20:03 yebai

Interesting ReverseDiff is so good with indexing, and wonder whether we can apply the same trick from ReverseDiff to improve Tracker and Zygote.

yebai avatar Mar 26 '20 20:03 yebai

On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting.

mohamed82008 avatar Mar 26 '20 23:03 mohamed82008

But I am trying to make RD faster.

mohamed82008 avatar Mar 27 '20 00:03 mohamed82008

I did some optimizations on ThArrays, https://github.com/TuringLang/ThArrays.jl/commit/c0217179fcc99e237e04ccb9161c2d0c3fc96003, indexing is a little faster(10%?) now.

KDr2 avatar Mar 27 '20 01:03 KDr2

I just re-run the same benchmarks. Here is an update:



julia> @btime ReverseDiff.gradient(mysum, x);

  1.194 ms (70023 allocations: 3.07 MiB)

julia> @btime Tracker.gradient(mysum, x);
  86.576 ms (190012 allocations: 769.27 MiB)

julia> @btime Zygote.gradient(mysum, x);


  90.346 ms (180107 allocations: 769.40 MiB)

yebai avatar Jun 19 '21 10:06 yebai