BenchmarkTools.jl icon indicating copy to clipboard operation
BenchmarkTools.jl copied to clipboard

Allocations and GC being ignored?

Open andyferris opened this issue 7 years ago • 8 comments

I'm finding different results using BenchmarkTools as opposed to my own loops and @time, particularly regarding allocating operations.

I'm benchmarking Base.Array vs the types in StaticArrays.jl. Some of these arrays are immutable, and don't perform GC allocations, and others are mutable. If you are doing lots of small, e.g., matrix multiplications, then the allocations and GC can dominate the cost, and an immutable array is much faster. I have a non-allocating SArray and a (mutable) allocating MArray.

The typical results I were getting using loops and @time showed that when new copies of Array or MArray were created, a lot of time was spent on allocation and GC (but not for SArray):

=====================================
   Benchmarks for 4×4 matrices
=====================================

Matrix multiplication
---------------------
Array               ->  6.526125 seconds (31.25 M allocations: 3.492 GB, 6.61% gc time)
SArray              ->  0.369290 seconds (5 allocations: 304 bytes)
MArray              ->  1.964021 seconds (15.63 M allocations: 2.095 GB, 12.05% gc time)

Matrix multiplication (mutating)
--------------------------------
Array               ->  4.540372 seconds (6 allocations: 576 bytes)
MArray              ->  0.748238 seconds (6 allocations: 448 bytes)

However, I switched my tests to BenchmarkTools.jl and now the difference between SArray and MArray has disappeared. It appears almost like the allocating costs have been ameliorated somehow. Perhaps I'm using the package wrong, but I get:

=====================================================================
 Matrices of size 4×4 and eltype Float64
=====================================================================
SArray:    m3 = m1 * m2 takes 52.00 ns, 144.00 bytes (GC 0.00 ns)
Array:     m3 = m1 * m2 takes 196.00 ns, 240.00 bytes (GC 0.00 ns)
MArray:    m3 = m1 * m2 takes 55.00 ns, 144.00 bytes (GC 0.00 ns)

Array:     A_mul_B!(m3, m1, m2) takes 150.00 ns, 0.00 bytes (GC 0.00 ns)
MArray:    A_mul_B!(m3, m1, m2) takes 20.00 ns, 0.00 bytes (GC 0.00 ns)

The two calls I make are @benchmark *($(copy(m)), $(copy(m))) and @benchmark A_mul_B!($(copy(m)), $(copy(m)), $(copy(m))), where m is some random matrix 4x4 I made out of the above types. Is that the right way to use @benchmark?

andyferris avatar Nov 02 '16 11:11 andyferris

Here I use mean instead of the (default) median:

=====================================================================
 Matrices of size 4×4 and eltype Float64
=====================================================================
SMatrix:    m3 = m1 * m2 takes 87.07 ns, 144.00 bytes (GC 19.19 ns)
Array:      m3 = m1 * m2 takes 258.94 ns, 240.00 bytes (GC 32.41 ns)
MMatrix:    m3 = m1 * m2 takes 87.09 ns, 144.00 bytes (GC 19.11 ns)

Array:      A_mul_B!(m3, m1, m2) takes 153.44 ns, 0.00 bytes (GC 0.00 ns)
MMatrix:    A_mul_B!(m3, m1, m2) takes 20.16 ns, 0.00 bytes (GC 0.00 ns)

Now we see GC (it seems in many case it doesn't run, so the median doesn't detect it) but it really doesn't agree with my benchmark (more GC for MArray and none for SArray).

andyferris avatar Nov 02 '16 11:11 andyferris

The BenchmarkTools results seem correct to me; it makes sense that the in-place matrix multiply isn't allocating, and that memory usage is small for small arrays.

Also, it seems that you're directly comparing aggregate measurements vs. BenchmarkTool's individual execution estimates, which are totally different measurements (e.g. these numbers seem to be comparing estimates of a single matmul with direct measurements of a large number of matmuls).

I might be missing something though, since I don't have the actual code. If you believe this is unexpected behavior, can you share the actual code used to generate these numbers (preferably self-contained + whatever packages are required)?

jrevels avatar Nov 02 '16 14:11 jrevels

I still don't know about the OP here, but found a case where this issue is real (https://github.com/JuliaLang/julia/issues/19257).

It comes from the nothing return block added in #4. If I remove that, it resolves the issue. I'll open a PR shortly.

jrevels avatar Nov 08 '16 00:11 jrevels

By resolves the issue, do you mean reports more allocations and longer run-times?

EDIT: Yes, I see that in the PR. Looks great.

andyferris avatar Nov 08 '16 01:11 andyferris

I don't know how https://github.com/JuliaCI/BenchmarkTools.jl/pull/28 affects the issue in the OP here because I can't verify the issue in the OP (I don't have the code).

https://github.com/JuliaCI/BenchmarkTools.jl/pull/28 fixes an issue where the function BenchmarkTools was constructing was "overly" amenable to code ellision (either from LLVM or Julia). Even after that PR, however, it's still simple to construct cases where the ellision will occur (see that PR for an example). I think in such cases, BenchmarkTools is correct to report no allocation, since running such a function "in the wild" should have the same effect given a similar calling environment.

jrevels avatar Nov 08 '16 01:11 jrevels

So #28 doesn't seem to affect my results in any meaningful way.

I'm going to have a deeper look sometime but I can't right now. The code is at https://github.com/JuliaArrays/StaticArrays.jl/blob/use-size/perf/benchmark3.jl but that might not be too useful to you (I'll try make a minimal example when I have more time).

andyferris avatar Nov 08 '16 05:11 andyferris

Just a note that #28 fixes a problem we had where some functions were being completely elided. Was just going to file an issue and then it is already fixed. What a luxury.

KristofferC avatar Nov 08 '16 13:11 KristofferC