Turing.jl
Turing.jl copied to clipboard
Simple benchmarking for Tracker, ReverseDiff, and Zygote
The following is a simple benchmarking for loops, on 3 different reverse-mode AD implementations in Julia. At the moment, in terms of efficiency: ReverseDiff
> Tracker
> Zygote
.
julia> mysum(x) = begin z = 0; for i =1:length(x); z += x[i]; end; return z; end
mysum (generic function with 1 method)
julia> using Zygote, Tracker, ReverseDiff, BenchmarkTools
julia> x = rand(10000);
julia> @btime ReverseDiff.gradient(mysum, x);
2.915 ms (70022 allocations: 2.92 MiB)
julia> @btime Tracker.gradient(mysum, x);
491.095 ms (150016 allocations: 767.59 MiB)
julia> @btime Zygote.gradient(mysum, x);
976.786 ms (280102 allocations: 1.50 GiB)
Benchmark setup:
(v1.2) pkg> st
Status `~/.julia/environments/v1.2/Project.toml`
[0bf59076] AdvancedHMC v0.2.20
[131c737c] ArviZ v0.2.4
[c52e3926] Atom v0.11.3
[6e4b80f9] BenchmarkTools v0.4.3
[5d742f6a] CSVFiles v1.0.0
[8f4d0f93] Conda v1.3.0
[a93c6f00] DataFrames v0.20.0
[163ba53b] DiffResults v1.0.2
[31c24e10] Distributions v0.21.12
[bbc10e6e] DynamicHMC v2.1.3
[f6369f11] ForwardDiff v0.10.8
[7073ff75] IJulia v1.20.2
[e5e0dc1b] Juno v0.7.2
[c7f686f2] MCMCChains v1.0.0
[91a5bcdd] Plots v0.28.4
[438e738f] PyCall v1.91.2
[37e2e3b7] ReverseDiff v1.1.0
[2913bbd2] StatsBase v0.32.0
[4c63d2b9] StatsFuns v0.9.3
[f3b207a7] StatsPlots v0.13.0
[1d978283] TensorFlow v0.11.0
[9f7883ad] Tracker v0.2.6
[fce5fe82] Turing v0.8.2
[0f1e0344] WebIO v0.8.13
[e88e6eb3] Zygote v0.4.9
[ade2ca70] Dates
cc: @phipsgabler
There is a more comprehensive one avaiable here: https://github.com/TuringLang/TuringExamples/blob/kx/improve_functionality/benchmarks/auto_diff/loop-vs-unrolled.ipynb
Numbers using PyTorch's via ThArrays
julia> using ThArrays
julia> @btime ThArrays.gradient(mysum, x);
477.310 ms (160017 allocations: 7.32 MiB)
This example might be over-favourable to ReverseDiff
. @mohamed82008 @xukai92
@KDr2 can you check whether ThArrays
have any type-instability issue?
julia> @code_warntype ThArrays.gradient(mysum, x);
Variables
#self#::Core.Compiler.Const(ThArrays.gradient, false)
f::Core.Compiler.Const(mysum, false)
data::Tuple{Array{Float64,1}}
Body::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
1 ─ %1 = Core.tuple(ThArrays.C_NULL, #self#, f)::Core.Compiler.Const((Ptr{Nothing} @0x0000000000000000, ThArrays.gradient, mysum), false)
│ %2 = Core._apply(ThArrays.:(#gradient#30), %1, data)::Tuple{Union{Int64, Tensor{_A,_B} where _B where _A},Tensor{_A,_B} where _B where _A}
└── return %2
@yebai define
+tensor_from_ptr(p::Ptr{Tensor{T, N}}) where {T, N} = Tensor{T, N}(p, nothing)
and
-function grad(self::Tensor)
+function grad(self::Tensor{T, N}) where {T, N}
outputs__ = Int[0]
__cret = ccall((:atg_grad, :libtorch_capi),
Cvoid, (Ptr{Cvoid}, Ptr{Cvoid}),
outputs__, self.pointer)
- __o_1 = tensor_from_ptr(Ptr{Cvoid}(outputs__[1]))
+ __o_1 = tensor_from_ptr(Ptr{Tensor{T, N}}(outputs__[1]))
return __o_1
end
can erase the type warnings, but I think (and from my experiment on this example) it has little influence on the benchmark results.
Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?
Maybe do a comparison on ThArrays and PyTorch to see if the result is sensible?
Can you code this example in C++, and report the run time?
#include <torch/torch.h>
#include <torch/csrc/autograd/variable.h>
#include <torch/csrc/autograd/function.h>
#include <typeinfo>
#include <iostream>
int main() {
torch::Tensor t = torch::rand({10000}, torch::requires_grad(1));
torch::Tensor r(t[0].clone());
r.set_requires_grad(1);
for(int i=0; i< 10000; i++ ) {
r += t[i];
}
r.backward();
// std::cout << t.grad() << std::endl;
t.grad();
return 0;
}
you can replace csrc/example_app.cpp
with this code, then run make
in deps/lib
, there will be an example.exe
.
On my machine, it takes 0.5s while the Julia version takes 1.25s.
I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.
If you use sum
instead of mysum
, Julia and C++ are at the same level.
I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.
you can replace csrc/example_app.cpp with this code, then run make in deps/lib, there will be an example.exe.
Are you running this Windows?
On my machine, it takes 0.5s while the Julia version takes 1.25s.
This result (0.5s) seems consistent with julia when using BenchmarkTools
:
julia> @btime ThArrays.gradient(mysum, x);
477.310 ms (160017 allocations: 7.32 MiB)
No, on linux, it's a typo, it's an ELF, sorry for the confusion.
$ file example-app
example-app: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=4cbe6786f7f899d38c1cc6be65e7435aba5faa0b, for GNU/Linux 3.2.0, not stripped
On my (old) machine, it takes 1.25s in Julia, takes 0.5s in C++.
julia> @btime ThArrays.gradient(mysum, x);
1.257 s (188995 allocations: 7.92 MiB)
Julia>
$ time ./example-app >/dev/null
real 0m0.519s
user 0m0.491s
sys 0m0.013s
thanks, @KDr2
julia> @btime ReverseDiff.gradient(mysum, x);
2.915 ms (70022 allocations: 2.92 MiB)
Based on the C++ runtime for PyTorch, maybe the superiority of ReverseDiff
is mainly from tape-caching, which avoids duplicate memory allocation?
I don't if we can compile our function to torchscript, if we can, these time-consuming ops would disappear at runtime.
This sounds plausible for a subset of Julia, given that Julia's syntax is so close with Python (torchscript is a subset of Python).
I think in the Julia version most time is spent on indexing op. In C++, these ops are single function call (and maybe inlined), while in the Julia version, each op contains many function calls.
I'm wondering if adding @inbounds
to mysum
could improve the performance of the Julia version?
In real-world practice, we may not do indexing in such a frequency, e.g. we should use sum
instead of mysum
.
About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.
About torchscript, I didn't mean a source to source translation, I think we can construct computational graph using Julia, then save the graph as torchscript or some internal format, then call it in Turing.
That works too, but at a cost of losing control flows: loops are unrolled, branches are recorded for specific branches only.
The numbers are slightly better with @inbounds
julia> mysum_inbounds(x) = begin z = 0; @inbounds for i =1:length(x); z += x[i]; end; return z; end
mysum_inbounds (generic function with 1 method)
julia> @btime ReverseDiff.gradient(mysum_inbounds, x);
2.436 ms (70022 allocations: 2.92 MiB)
julia> @btime Tracker.gradient(mysum_inbounds, x);
439.518 ms (150016 allocations: 767.59 MiB)
julia> @btime Zygote.gradient(mysum_inbounds, x);
856.985 ms (280102 allocations: 1.50 GiB)
julia> @btime ThArrays.gradient(mysum_inbounds, x);
465.540 ms (160017 allocations: 7.32 MiB)
I guess you should also use $
to interpolate x
. It might affect the results since you're benchmarking with a global variable.
thanks @devmotion, I rerun the benchmarks with $x
. The results aren't very different from above ones.
Interestingly, I did another test to separate the effects of indexing
and loops
, which produces:
julia> mysum2(x) = begin z = 0;_x=x[1]; for i =1:10000; z += _x; end; return z; end
mysum (generic function with 1 method)
julia> @btime ReverseDiff.gradient(mysum2, [1.0]);
2.396 ms (60021 allocations: 2.23 MiB)
julia> @btime Tracker.gradient(mysum2, [1.0]);
945.712 μs (60022 allocations: 1.83 MiB)
julia> @btime Zygote.gradient(mysum2, [1.0]);
7.260 ms (100108 allocations: 3.54 MiB)
julia> @btime ThArrays.gradient(mysum2, [1.0]);
69.599 ms (40026 allocations: 1.68 MiB)
This result together with the above ones suggest:
-
Zygote
isn't that bad withloops
(see https://github.com/FluxML/Zygote.jl/issues/371 and https://github.com/FluxML/Zygote.jl/issues/312), but bad atindexing
(see https://github.com/FluxML/Zygote.jl/issues/365) -
ReverseDiff
is ok with bothloops
andindexing
- ~~
Tracker
is quite bad withloops
, not sure aboutindexing
.~~ -
Tracker
is good withloops
, but quite bad withindexing
-
PyTorch
/ThArrays
is ok with loops but bad withindexing
.
Ps
- Zygote is slow with broadcasting: https://github.com/FluxML/Zygote.jl/issues/592
Tracker is quite bad with loops, not sure about indexing.
Tracker is actually the fastest in the loop benchmark, isn't it?
Tracker is actually the fastest in the loop benchmark, isn't it?
You're right! I didn't notice the unit!
Interesting ReverseDiff
is so good with indexing, and wonder whether we can apply the same trick from ReverseDiff
to improve Tracker
and Zygote
.
On the flip side, ReverseDiff and Zygote are both slower than Tracker in code with broadcasting.
But I am trying to make RD faster.
I did some optimizations on ThArrays, https://github.com/TuringLang/ThArrays.jl/commit/c0217179fcc99e237e04ccb9161c2d0c3fc96003, indexing is a little faster(10%?) now.
I just re-run the same benchmarks. Here is an update:
julia> @btime ReverseDiff.gradient(mysum, x);
1.194 ms (70023 allocations: 3.07 MiB)
julia> @btime Tracker.gradient(mysum, x);
86.576 ms (190012 allocations: 769.27 MiB)
julia> @btime Zygote.gradient(mysum, x);
90.346 ms (180107 allocations: 769.40 MiB)