RigidBodyDynamics.jl
RigidBodyDynamics.jl copied to clipboard
`BLAS.set_num_threads(1)` drastically improves `dynamics!` performance for Atlas benchmark
Found this out this week. Baseline on current master:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 32.997 μs (0.00% GC)
median time: 34.839 μs (0.00% GC)
mean time: 36.525 μs (0.00% GC)
maximum time: 95.380 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
(a bit slower than http://www.juliarobotics.org/RigidBodyDynamics.jl/latest/benchmarks.html for some reason, by the way)
After using LinearAlgebra; BLAS.set_num_threads(1)
:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 21.929 μs (0.00% GC)
median time: 22.224 μs (0.00% GC)
mean time: 23.062 μs (0.00% GC)
maximum time: 40.309 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
So multiple BLAS threads aren't doing us any favor (with the default OpenBLAS anyway), at least for fewer-than-Atlas degrees of freedom.
Not sure how to act on this since this is a global setting. At least BLAS.set_num_threads
is fast now (about 4 ns, no allocations), so it could be called before and after dynamics_solve!
. But is that thread safe? And how do we get the number of BLAS threads before calling BLAS.set_num_threads(1)
so that it can be reset to the original value? Maybe it should just be a performance tip?
With MKL instead of OpenBLAS this is not an issue. I suspect https://github.com/xianyi/OpenBLAS/issues/1622.
Can you get 'perf record "command" ; perf report" from both cases? ...zoom to see which functions in libopenblas.so.0 are being touched (probably add iterations so that more is seen)
Just in case it is something else than _GEMM, ones with less dimensions can be calibrated faster. There are some improvements for small GEMM lately for recent Intel series (if that applies)
Julia's versioninfo()
should tell enough about BLAS & CPU core in use.
@brada4, thank you for looking at this.
Here's what I did (verbose, mostly just to make this reproducible for myself). I created a test.jl
file containing
using RigidBodyDynamics
using AtlasRobot
using Random
function run_benchmark(result, state, n)
for i = 1 : n
rand!(state)
dynamics!(result, state)
end
return nothing
end
function main()
mechanism = AtlasRobot.mechanism()
state = MechanismState(mechanism)
result = DynamicsResult(mechanism)
@time run_benchmark(result, state, 1_000_000)
end
main()
Running this benchmark with julia -O3 --check-bounds=no
takes about 35 seconds; compilation time should only be a small fraction of this time.
I then ran sudo cpupower frequency-set --governor performance
, followed by
sudo perf record julia --project -O3 --check-bounds=no test.jl # sudo to avoid 'Couldn't record kernel reference relocation symbol' warning
(after running sudo sh -c 'echo -1 >/proc/sys/kernel/perf_event_paranoid'
).
Output:
36.554262 seconds (12.29 M allocations: 557.993 MiB, 0.81% gc time)
[ perf record: Woken up 160 times to write data ]
[ perf record: Captured and wrote 43.549 MB perf.data (1141059 samples) ]
Top entries from perf report
:
12.07% julia libopenblas64_.so [.] dtrsm_kernel_RN_HASWELL
10.85% julia [kernel.kallsyms] [k] syscall_return_via_sysret
4.57% julia libpthread-2.27.so [.] __pthread_mutex_lock
4.32% julia [kernel.kallsyms] [k] __schedule
3.81% julia libopenblas64_.so [.] dgemm_kernel_HASWELL
3.41% julia libopenblas64_.so [.] dtrsm_oltncopy_HASWELL
3.21% julia [kernel.kallsyms] [k] update_curr
2.68% julia [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
2.44% julia [kernel.kallsyms] [k] do_syscall_64
2.30% julia [kernel.kallsyms] [k] pick_next_task_fair
1.83% julia libc-2.27.so [.] __sched_yield
1.77% julia [kernel.kallsyms] [k] __calc_delta
1.41% julia libopenblas64_.so [.] blas_thread_server
1.40% julia [kernel.kallsyms] [k] yield_task_fair
1.35% julia [unknown] [k] 0xfffffe00000e201b
1.34% julia [unknown] [k] 0xfffffe000016601b
1.33% julia [unknown] [k] 0xfffffe000019201b
1.33% julia [unknown] [k] 0xfffffe000013a01b
1.25% julia [kernel.kallsyms] [k] sys_sched_yield
1.21% julia [kernel.kallsyms] [k] pick_next_entity
1.17% julia [unknown] [k] 0xfffffe000005e01b
1.15% julia [kernel.kallsyms] [k] rcu_note_context_switch
1.08% julia libpthread-2.27.so [.] __pthread_mutex_unlock
1.06% julia [unknown] [k] 0xfffffe000008a01b
1.04% julia [unknown] [k] 0xfffffe00000b601b
1.04% julia libopenblas64_.so [.] dgemm_itcopy_HASWELL
1.04% julia [kernel.kallsyms] [k] _raw_spin_lock
1.01% julia [kernel.kallsyms] [k] __indirect_thunk_start
0.93% julia [kernel.kallsyms] [k] native_sched_clock
0.89% julia [kernel.kallsyms] [k] cpuacct_charge
0.87% julia [kernel.kallsyms] [k] update_min_vruntime
0.84% julia [kernel.kallsyms] [k] entry_SYSCALL_64_stage2
0.76% julia [kernel.kallsyms] [k] schedule
0.75% julia [kernel.kallsyms] [k] update_rq_clock
0.75% julia [kernel.kallsyms] [k] cgroup_cpu_stat_updated
0.66% julia [kernel.kallsyms] [k] __cgroup_account_cputime
0.51% julia [kernel.kallsyms] [k] sched_clock_cpu
The results above are all without calling BLAS.set_num_threads(1)
. For comparison, here are the results, with BLAS.set_num_threads(1)
.
4.92% julia libopenblas64_.so [.] dgemm_kernel_HASWELL
2.08% julia libjulia.so.1.0 [.] jl_gc_pool_alloc
1.84% julia libjulia.so.1.0 [.] forall_exists_subtype
1.83% julia libjulia.so.1.0 [.] obviously_disjoint.part.14
1.79% julia libjulia.so.1.0 [.] subtype
1.71% julia libjulia.so.1.0 [.] gc_sweep_pool
1.63% julia libjulia.so.1.0 [.] gc_mark_loop
1.52% julia libjulia.so.1.0 [.] jl_unwrap_unionall
1.40% julia libopenblas64_.so [.] dgemv_n_HASWELL
1.34% julia libopenblas64_.so [.] dtrsm_kernel_RN_HASWELL
1.18% julia libjulia.so.1.0 [.] jl_apply_generic
1.06% julia libjulia.so.1.0 [.] jl_f_getfield
1.01% julia libopenblas64_.so [.] dsyrk_kernel_L
0.96% julia libopenblas64_.so [.] dgemm_beta_HASWELL
0.92% julia libjulia.so.1.0 [.] jl_field_index
0.89% julia libopenblas64_.so [.] dgemm_otcopy_HASWELL
0.85% julia libjulia.so.1.0 [.] jl_alloc_array_1d
0.78% julia libopenblas64_.so [.] dtrsm_iltncopy_HASWELL
0.71% julia libc-2.27.so [.] cfree@GLIBC_2.2.5
0.69% julia libopenblas64_.so [.] dtrsm_ilnncopy_HASWELL
0.67% julia libopenblas64_.so [.] dpotf2_L
0.65% julia libc-2.27.so [.] __memmove_avx_unaligned_erms
0.65% julia sys.so [.] japi1_cache_lookup_658.clone_1.clone_2
0.59% julia libjulia.so.1.0 [.] has_free_typevars
0.58% julia libjulia.so.1.0 [.] jl_typemap_entry_assoc_exact
0.56% julia libjulia.so.1.0 [.] lookup_type_idx.isra.12
0.56% julia libopenblas64_.so [.] dtrsm_RN_solve_opt
0.55% julia libopenblas64_.so [.] dgemm_itcopy_HASWELL
0.55% julia libjulia.so.1.0 [.] forall_exists_equal.part.23
0.54% julia libjulia.so.1.0 [.] typekey_compare.part.11
0.53% julia libopenblas64_.so [.] dtrsm_kernel_LN_HASWELL
0.50% julia libopenblas64_.so [.] dot_compute
Julia version info:
julia> using InteractiveUtils; versioninfo()
Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)
julia> using LinearAlgebra; BLAS.openblas_get_config()
"USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16"
This is on an i7-6950X CPU @ 3.00GHz (10 physical cores, 20 hyperthreaded).
Let me know if you need the perf.data
file.
There is no need for perf.data, it is nearly impossible to decode on even slightly different system, does not look very specific to your setup, just that 10/20 cores dramaticize excess parallelism effects better than more common CPUs.
Looks like trsm
threads start to early
https://github.com/xianyi/OpenBLAS/issues/1886
Best you can do before after new year - add #undef SMP
early in interface/trsm.c
and rebuild OpenBLAS , i.e disable trsm threading but leave GEMM threading intact.
OK, thanks, I'll track that issue.
Hello! I am wondering if this is still an issue? Here are my results on an Apple MacBook Air (M1, 2020):
--- results-1-thread.txt 2023-05-06 12:49:34
+++ results-4-threads.txt 2023-05-06 12:49:22
@@ -1,105 +1,105 @@
dynamics_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 3.054 μs … 8.408 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 3.129 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 3.145 μs ± 107.831 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 3.054 μs … 8.871 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 3.112 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 3.124 μs ± 120.414 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 10.492 μs … 262.429 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 10.608 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 10.710 μs ± 2.535 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 10.508 μs … 26.962 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 10.600 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 10.612 μs ± 181.451 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
momentum:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.771 μs … 4.638 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.821 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.823 μs ± 57.144 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.767 μs … 4.458 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.850 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.850 μs ± 60.572 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
mass_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 3.800 μs … 12.929 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 3.946 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 3.971 μs ± 210.354 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 3.808 μs … 6.621 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 3.900 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 3.933 μs ± 132.276 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
momentum_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 2.112 μs … 11.562 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 2.188 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 2.190 μs ± 114.719 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 2.104 μs … 4.821 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 2.167 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 2.170 μs ± 59.499 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
center_of_mass:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.171 μs … 4.283 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.221 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.223 μs ± 65.128 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.175 μs … 4.196 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.225 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.228 μs ± 57.789 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.600 μs … 4.229 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.650 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.656 μs ± 99.179 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.583 μs … 3.050 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.637 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.641 μs ± 37.095 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
constraint_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 5.871 μs … 14.075 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 5.967 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 6.060 μs ± 253.963 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 5.921 μs … 762.712 μs ┊ GC (min … max): 0.00% … 98.98%
+ Time (median): 6.017 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 6.172 μs ± 7.569 μs ┊ GC (mean ± σ): 1.22% ± 0.99%
Memory estimate: 496 bytes, allocs estimate: 1.
gravitational_potential_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.183 μs … 2.342 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.242 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.241 μs ± 22.244 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.175 μs … 3.083 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.238 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.240 μs ± 56.878 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
inverse_dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 2.683 μs … 13.863 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 2.771 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 2.788 μs ± 162.449 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 2.683 μs … 5.279 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 2.746 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 2.754 μs ± 96.378 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
simulate tree:
BenchmarkTools.Trial: 9 samples with 10 evaluations.
- Range (min … max): 49.602 ms … 50.468 ms ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 49.889 ms ┊ GC (median): 0.00%
- Time (mean ± σ): 49.976 ms ± 284.056 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 49.606 ms … 52.434 ms ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 50.114 ms ┊ GC (median): 0.00%
+ Time (mean ± σ): 50.338 ms ± 820.355 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
constraint_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 14.696 μs … 832.163 μs ┊ GC (min … max): 0.00% … 98.02%
- Time (median): 15.104 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 15.251 μs ± 8.177 μs ┊ GC (mean ± σ): 0.53% ± 0.98%
+ Range (min … max): 14.762 μs … 8.446 ms ┊ GC (min … max): 0.00% … 99.78%
+ Time (median): 15.196 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 17.100 μs ± 88.417 μs ┊ GC (mean ± σ): 4.93% ± 1.00%
Memory estimate: 496 bytes, allocs estimate: 1.
mass_matrix! and geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 4.071 μs … 7.154 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 4.188 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 4.207 μs ± 96.458 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 4.050 μs … 17.783 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 4.225 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 4.273 μs ± 343.992 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
momentum_rate_bias:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 2.554 μs … 5.183 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 2.617 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 2.624 μs ± 105.863 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 2.575 μs … 5.446 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 2.654 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 2.670 μs ± 101.424 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
kinetic_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.758 μs … 3.658 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.817 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.818 μs ± 44.967 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.771 μs … 4.075 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.854 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.855 μs ± 64.257 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
Here's the output of versioninfo()
:
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores
And here is how I ran the benchmarks:
cd ~/git/RigidBodyDynamics.jl/perf
julia --project -O3 --check-bounds=no
include("runbenchmarks.jl")
I ran this once with the contents of perf/runbenchmarks.jl
as is, and then a second time after commenting line 13 of perf/runbenchmarks.jl
(the one that sets BLAS.set_num_threads(1)
).
On my old (but trustworthy) Dell XPS 13 9343 (2015) using Julia 1.8.5, I also cannot observe any noticeable differences in performance:
--- results-single-thread.txt 2023-05-06 21:07:21.302449867 +0100
+++ results-multi-thread.txt 2023-05-06 21:07:26.526504507 +0100
@@ -1,59 +1,59 @@
dynamics_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 7.248 μs ± 454.342 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 7.252 μs ± 295.422 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 24.448 μs ± 1.159 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 24.113 μs ± 1.609 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
momentum:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 4.582 μs ± 435.294 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 4.694 μs ± 276.963 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
mass_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 12.483 μs ± 553.176 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 11.935 μs ± 523.177 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
momentum_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 5.256 μs ± 250.141 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 5.262 μs ± 517.004 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
center_of_mass:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 3.007 μs ± 280.680 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 3.013 μs ± 222.686 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 3.950 μs ± 198.062 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 4.064 μs ± 493.502 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
constraint_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 14.426 μs ± 17.260 μs ┊ GC (mean ± σ): 1.19% ± 0.99%
+ Time (mean ± σ): 14.329 μs ± 16.452 μs ┊ GC (mean ± σ): 1.14% ± 0.99%
gravitational_potential_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 3.001 μs ± 273.534 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 3.047 μs ± 243.480 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
inverse_dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 6.627 μs ± 410.959 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 6.737 μs ± 501.527 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
simulate tree:
BenchmarkTools.Trial: 4 samples with 10 evaluations.
- Time (mean ± σ): 116.726 ms ± 766.823 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 114.619 ms ± 1.122 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
constraint_jacobian!:
-BenchmarkTools.Trial: 9410 samples with 10 evaluations.
- Time (mean ± σ): 50.597 μs ± 19.226 μs ┊ GC (mean ± σ): 0.39% ± 1.00%
+BenchmarkTools.Trial: 9236 samples with 10 evaluations.
+ Time (mean ± σ): 51.511 μs ± 25.033 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
mass_matrix! and geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 12.391 μs ± 452.979 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 12.118 μs ± 715.884 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
momentum_rate_bias:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 6.308 μs ± 521.010 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 6.290 μs ± 357.568 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
kinetic_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 4.585 μs ± 357.360 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 4.571 μs ± 251.657 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Here's the output of versioninfo()
on this laptop:
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 1 on 4 virtual cores
I did some further investigation and I've found that there was a massive leap in performance introduced by Julia version 1.8.
The plots below should demonstrate my findings a bit better.
The benchmark data and the script to generate the plot above can be found in this GitHub Gist.
Interesting. I'd be excited to see things get that much faster from a newer Julia version, but I wonder if something else is going on here. As an example, the pre-1.8 results for dynamics_bias!
look really slow. I would expect that call to take much less than 1ms on any Julia version. For example, here's what I get on my ~5 year old laptop running Julia 1.7.1:
dynamics_bias!:
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.957 μs (0.00% GC)
median time: 6.221 μs (0.00% GC)
mean time: 6.223 μs (0.00% GC)
maximum time: 10.473 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
That's much more in line with what you saw in 1.8 and later, and it's more like what I would have expected to begin with.
You're right! There's something else going on... I guess I got a bit too carried away with the idea of having a big improvement from a major Julia version upgrade. Let's focus on just using Julia 1.7.3, otherwise there's too many changing environment factors.
I've been trying to identify where the performance is taking a hit, and I've found that, using Julia v1.7.3, the degradation takes place somewhere between RBD.jl versions 2.3.0 and 2.3.1 ─ but it may not necessarily be a problem with RBD.jl (it may be one of its dependencies).
Note Because the
runbenchmarks.jl
script lives in a folder with its own Project.toml, we need to take special care changing versions in that environment before getting the results.
So, here is what I did step-by-step on my Dell XPS 13:
~/git/RigidBodyDynamics.jl/perf # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
] add ..#b3a632064f1e2b731c5bb532201dcdb7a3567ea5 # Explicitly add RigidBodyDynamics v2.3.0
] up # Update all the dependencies
CTRL + D # Stop the Julia REPL session for the changes to deps versions to take effect
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
include("runbenchmarks.jl") # Run the benchmarks
Then I did the same for RBD.jl 2.3.1:
~/git/RigidBodyDynamics.jl/perf # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7 # Explicitly add RigidBodyDynamics v2.3.1
] up # Update all the dependencies
CTRL + D # Stop the Julia REPL session for the changes to deps versions to take effect
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
include("runbenchmarks.jl") # Run the benchmarks
The dependencies' version changes look like this: (Posting as a picture because the colourful output is helpful.)
Here are the results for those two different versions:
Table I. Time (mean) per method, for RBD.jl 2.3.0 vs 2.3.1.
Method | RBD.jl 2.3.0 | RBD.jl 2.3.1 |
---|---|---|
dynamics_bias! | 7.586 μs | 821.496 μs |
dynamics! | 19.928 μs | 664.098 μs |
momentum | 4.854 μs | 205.602 μs |
mass_matrix! | 6.972 μs | 258.167 μs |
momentum_matrix! | 5.528 μs | 249.821 μs |
center_of_mass | 3.049 μs | 101.816 μs |
geometric_jacobian! | 4.173 μs | 142.554 μs |
constraint_bias! | 15.036 μs | 1.368 ms |
gravitational_potential_energy | 3.034 μs | 102.590 μs |
inverse_dynamics! | 6.885 μs | 443.869 μs |
constraint_jacobian! | 33.388 μs | 762.573 μs |
mass_matrix! and geometric_jacobian! | 7.480 μs | 261.588 μs |
momentum_rate_bias | 6.851 μs | 495.052 μs |
kinetic_energy | 4.850 μs | 206.755 μs |
Note
simulate tree
is excluded from the table above because on RBD.jl 2.3.1 it took too long to evaluate and provided only one sample for the benchmark.
Julia 1.7.3 with RBD.jl 2.3.0 produced the values which I think make the most sense (given Twan's previous benchmarks and Robin's comment above) ─ and so I am going to take those as my "ground truth" from now on.
Next, I focused on investigating changes to dependencies versions, while keeping Julia at v1.7.3 and RBD.jl at 2.3.1. The table below shows the results of those benchmarks. RBD.jl 2.3.1 with its default dependencies (1st results column) is the version where the performance has taken a hit, so what I tried was to downgrade some of its dependencies based on the image I shared above. I tried downgrading Rotations.jl (2nd results column), and then StaticArrays.jl as well (3rd results column).
Table II. Time (mean) per method, for RBD.jl 2.3.1 with different dependencies versions.
Julia | 1.7.3 | 1.7.3 | 1.7.3 | 1.7.3 | 1.7.3 |
RBD.jl | 2.3.1 | 2.3.1 | 2.3.1 | 2.3.1 | 2.3.1 |
Rotations.jl | 1.4.0 | 1.3.4 | 1.3.0 | 1.2.0 | 1.2.0 |
StaticArrays.jl | 1.5.24 | 1.5.24 | 1.5.24 | 1.5.24 | 0.12.6 |
dynamics_bias! | 821.496 μs | 845.163 μs | 892.266 μs | 7.421 μs | 7.690 μs |
dynamics! | 664.098 μs | 668.992 μs | 732.807 μs | 25.610 μs | 19.421 μs |
momentum | 205.602 μs | 207.841 μs | 219.271 μs | 4.732 μs | 5.990 μs |
mass_matrix! | 258.167 μs | 258.367 μs | 288.175 μs | 15.182 μs | 7.858 μs |
momentum_matrix! | 249.821 μs | 251.832 μs | 265.107 μs | 5.466 μs | 6.053 μs |
center_of_mass | 101.816 μs | 100.351 μs | 105.490 μs | 3.138 μs | 3.085 μs |
geometric_jacobian! | 142.554 μs | 144.602 μs | 149.416 μs | 4.184 μs | 4.185 μs |
constraint_bias! | 1.368 ms | 1.345 ms | 1.299 ms | 14.565 μs | 15.719 μs |
gravitational_potential_energy | 102.590 μs | 103.186 μs | 105.668 μs | 3.127 μs | 3.063 μs |
inverse_dynamics! | 443.869 μs | 447.642 μs | 472.323 μs | 6.909 μs | 6.755 μs |
constraint_jacobian! | 762.573 μs | 680.845 μs | 691.560 μs | 54.571 μs | 34.138 μs |
mass_matrix! and geometric_jacobian! | 261.588 μs | 261.593 μs | 274.236 μs | 12.337 μs | 7.484 μs |
momentum_rate_bias | 495.052 μs | 512.251 μs | 522.313 μs | 6.510 μs | 6.504 μs |
kinetic_energy | 206.755 μs | 208.422 μs | 220.578 μs | 4.957 μs | 4.692 μs |
Looking at the results in Table II, it seems the main performance hit is from Rotations.jl 1.2
–1.4
, but there is something happening also between StaticArrays.jl 0.12.6
–1.5.24
.
And that's it for now... :slightly_frowning_face: This is where I am at with my current investigation. I will try to dig a bit further tomorrow.
It is now clear to me that the performance hit was introduced with Rotations.jl 1.3.0
, as the immediate previous version to that (1.2.0
) produces results close to our "ground truth". (Let's set aside the smaller performance hit taken somewhere between StaticArrays.jl 0.12.6
–1.5.24
.)
Another dependency that changes when upgrading from Rotations.jl 1.2.0
to 1.3.0
is Quaternions.jl, from 0.4.9
to 0.5.7
. However, I manually tested the combination of Rotations.jl 1.3.0
with Quaternions.jl 0.4.9
, and the benchmark results were the same as the ones from using Rotations.jl 1.3.0
with Quaternions.jl 0.5.7
─ meaning that the changes between those two versions of Quaternions.jl did not affect RBD.jl.
At this point, I went through the commits that make up Rotations.jl release 1.3.0
. I found that the commit that introduced the "issue" was commit 345d0d9 ─ which is the merging of PR https://github.com/JuliaGeometry/Rotations.jl/pull/219.
Using some intuition I tried to revert the changes to the RotMatrix
constructor, and using that as a patch on top of version 1.3.0 ─ see https://github.com/ferrolho/Rotations.jl/commit/9bea25b2407a580de3343f33b1a8f34305c81ab8 ─ allowed me to recover the good benchmark results from previous versions.
Table III. Time (mean) per method, for RBD.jl 2.3.1 and StaticArrays.jl 1.5.24, without and with a patched Rotations.jl 1.3.0.
Julia | 1.7.3 | 1.7.3 | 1.8.5 | 1.7.3 | 1.8.5 |
RBD.jl | 2.3.1 | 2.3.1 | 2.3.1 | 2.3.1 | 2.3.1 |
Rotations.jl | 1.3.0 | 1.3.0 (patch) | 1.3.0 (patch) | 1.3.0 (patch) | 1.3.0 (patch) |
StaticArrays.jl | 1.5.24 | 1.5.24 | 1.5.24 | 1.5.24 | 1.5.24 |
Number of Threads | Single | Single | Single | Multi | Multi |
dynamics_bias! | 892.266 μs | 7.860 μs | 7.318 μs | 7.679 μs | 7.651 μs |
dynamics! | 732.807 μs | 25.106 μs | 24.347 μs | 66.197 μs | 24.443 μs |
momentum | 219.271 μs | 4.694 μs | 4.648 μs | 4.817 μs | 4.730 μs |
mass_matrix! | 288.175 μs | 11.654 μs | 12.146 μs | 12.618 μs | 12.280 μs |
momentum_matrix! | 265.107 μs | 5.594 μs | 5.356 μs | 5.429 μs | 5.731 μs |
center_of_mass | 105.490 μs | 3.085 μs | 3.147 μs | 3.196 μs | 3.214 μs |
geometric_jacobian! | 149.416 μs | 4.286 μs | 4.082 μs | 4.027 μs | 4.048 μs |
constraint_bias! | 1.299 ms | 14.783 μs | 15.263 μs | 15.230 μs | 15.238 μs |
gravitational_potential_energy | 105.668 μs | 3.041 μs | 3.977 μs | 3.147 μs | 3.392 μs |
inverse_dynamics! | 472.323 μs | 6.660 μs | 7.745 μs | 7.352 μs | 6.821 μs |
constraint_jacobian! | 691.560 μs | 53.780 μs | 54.215 μs | 53.991 μs | 51.407 μs |
mass_matrix! and geometric_jacobian! | 274.236 μs | 13.455 μs | 12.106 μs | 12.284 μs | 12.252 μs |
momentum_rate_bias | 522.313 μs | 6.411 μs | 6.292 μs | 6.517 μs | 6.409 μs |
kinetic_energy | 220.578 μs | 4.702 μs | 4.871 μs | 4.781 μs | 4.601 μs |
Here's how to reproduce the results in the 1st timings column:
cd ~/git/RigidBodyDynamics.jl/perf # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7 # Explicitly add RigidBodyDynamics.jl 2.3.1
] up # Update all the dependencies
include("runbenchmarks.jl") # Run the benchmarks
And here is how to reproduce the results in the 2nd timings column:
cd ~/git/RigidBodyDynamics.jl/perf # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7 # Explicitly add RigidBodyDynamics.jl 2.3.1
] up # Update all the dependencies
] add https://github.com/ferrolho/Rotations.jl#rbdjl-issue-500 # Add patch to Rotations.jl 1.3.0
include("runbenchmarks.jl") # Run the benchmarks
As for the remaining columns, I collected them in a similar fashion but using Julia 1.8.5. As you can see, there is no difference in performance between single and multi-threaded BLAS in Julia 1.8.5 (in contrast to what we see for Julia 1.7.3) ─ so maybe we can close this issue?
As for the difference in performance due to Rotations.jl, it seems that on Julia 1.8.5 and with the latest version of RBD.jl (on master), we don't observe the close-to-1-ms times, so the patch that I've applied in this discovery has probably already been applied in one form or another in recent versions of that package.
So, having both those things sorted, the only outstanding issue I can see is with StaticArrays.jl, which on my Dell XPS 13 has increased the mean time of dynamics! in this benchmark from ~19 μs to ~24 μs.
Wow, this is excellent. Thanks for the detailed investigation!
I'm a little confused about what you said here:
As for the difference in performance due to Rotations.jl, it seems that on Julia 1.8.5 and with the latest version of RBD.jl (on master), we don't observe the close-to-1-ms times, so the patch that I've applied in this discovery has probably already been applied in one form or another in recent versions of that package.
I wonder if that's just because the master version of RBD.jl is holding Rotations.jl back to some version prior to the bad 1.3.0 version? In any case, I think we should:
- Try to get your patch merged into Rotations.jl and get them to tag a new release
- Restrict RBD.jl to Rotations.jl < 1.3 until that happens.