AutoPreallocation.jl icon indicating copy to clipboard operation
AutoPreallocation.jl copied to clipboard

preallocate actually has more (while not by number of bytes) allocations!

Open PallHaraldsson opened this issue 5 years ago • 13 comments

FYI: I was looking into this package, and ran the code from the README, and (at least on Julia 1.5) it has two extra allocations.

[From the tag-line: "Remember what memory we needed last time and use it gain every time after", better to change to "to gain" and finish with a period?]

Since preallocation wasn't clearly (much) faster, I was thinking isn't that the point (and to have less/or not variability for real-time), maybe the timing wasn't reliable, so I checked:

julia> @btime f(A, B, C);
  33.333 ms (4 allocations: 15.26 MiB)

julia> @btime preallocated_f(A, B, C);
  27.981 ms (6 allocations: 272 bytes)

julia> @btime preallocated_f(A, B, C);
  38.364 ms (6 allocations: 272 bytes)

julia> @btime preallocated_f(A, B, C);
  21.913 ms (6 allocations: 272 bytes)

julia> @btime f(A, B, C);
  22.533 ms (4 allocations: 15.26 MiB)

I guess this is also useful for GPUs, some such way needed there?

Is a better example needed to show the usefulness? FYI2, I tried with a larger arrays (and I think it mostly shows how fast/good Julia's GC is):

julia> A, B, C = (rand(10000, 10000) for _ in 1:3)
Base.Generator{UnitRange{Int64},var"#14#15"}(var"#14#15"(), 1:3)

[..]

julia> @btime f(A, B, C);
  23.648 s (4 allocations: 1.49 GiB)

julia> @btime preallocated_f(A, B, C);
  22.597 s (6 allocations: 272 bytes)

julia> @time preallocated_f(A, B, C);
 22.931816 seconds (6 allocations: 272 bytes)

julia> @time f(A, B, C);
 22.840137 seconds (4 allocations: 1.490 GiB)

PallHaraldsson avatar Apr 20 '20 11:04 PallHaraldsson

Correct, preallocation has more allocations, but they are much smaller. They should not be there, and I think can be removed, it just required a fair bit of careful digging around to find and fix them. I think they are related to type inference failing. Getting down to zero allocations in using a thing is fiddly

I believe the performance changes are not visiable in small benchmarks. A better example is needed. I suspect this would do quite well on Optim (not sure how dynamic optims memory allocation pattern is though) @Roger-luo reports big speedups with Zygote/Flux

oxinabox avatar Apr 20 '20 11:04 oxinabox

small benchmarks

See me addition above. Yes, I figured the actual number of allocations are not a big deal, still surprising that the amount of bytes doesn't matter much. [Most useful comment here might be about the tagline?]

PallHaraldsson avatar Apr 20 '20 11:04 PallHaraldsson

small benchmarks See me addition above. Yes, I figured the actual number of allocations, not a big deal, still surprising that the amount of bytes doesn't matter much.

Yeah not just size of memory allcoation, i think this shows up most when you have a large number of largish allocations. In particular you need to be actually triggering GC for it to really matter I think.

(and I think it mostly shows how fast/good Julia's GC is):

You are probably finding that julia GC is not being triggered. GC only runs when you start to run a bit low on RAM.

But it is showing how good Julia's allocator is at creating memory

oxinabox avatar Apr 20 '20 11:04 oxinabox

The D language has @nogc for functions (and those functions can only call such functions). I've thought such might be useful for Julia (should be able to do in a macro? While tricky since it has to limit calling to calling such functions).

But it is showing how good Julia's allocator is at creating memory

Right, I was trying to exercise it, and yes, seeing GC triggered (might work on your machine to test). Since I've gotten 128 GB in my desktop, I've probably not seen it since...

As triggering GC often actually happens on other machines, being able to simulate such an environment would be nice and good for knowing the slowdown, maybe I (or could BenchmarkTools simulate such an environment?) should sometimes time with some VM or some settings (ulimit); prlimit which wasn't too helpful:

prlimit -d700000000:700000000 ~/julia-1.5-012b270df6/bin/julia

PallHaraldsson avatar Apr 20 '20 12:04 PallHaraldsson

I think somehow there is a performance regression on Julia 1.5, although I thought https://github.com/JuliaLang/julia/pull/34126 would fix our last a few small allocations due to RefValue, now we are having more allocations even in the README example...

I'm not sure where the allocation actually happens, @oxinabox any idea how to actually find the allocations? seems quite nasty to read the IR code directly to me.

Roger-luo avatar May 07 '20 13:05 Roger-luo

I do not have a good way

oxinabox avatar May 07 '20 15:05 oxinabox

This is supposed to do it, but it's not generating .mem files for me. Not sure what I'm doing wrong: https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis

cscherrer avatar Mar 17 '21 16:03 cscherrer

Ok so this is not a great way to do it, but it's a start...

  1. Make sure the functions you want to inspect are in a file. They might even need to be in a package, I'm not sure about that. But the REPL isn't enough.
  2. Start Julia with julia --track-allocation=all
  3. Call the functions you're interested in
  4. Call Profile.clear_malloc_data()
  5. Call them again, this time for the actual measurement
  6. Exit Julia

Now, you can use fd "\.mem" using the fd utility (or find is fine though a bit slower) to get the files.

To get the actual lines, you can do fd "\.mem" | xargs sort -n

In this case, the ones with actual numbers (allocations) were

       32 println(io::IO, xs...) = print(io, xs..., "\n")
       48     sort!(p, alg, Perm(ordr,v))
       80     b = Vector{T}(undef, length(a))
       96 println(xs...) = println(stdout::IO, xs...)
      128         y = iterate(itr)
      144         show(io, x)
      224         collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)
      256         y = iterate(itr, st)
      432             print(io, x)
      448 indexed_iterate(t::Tuple, i::Int, state=1) = (@_inline_meta; (getfield(t, i), i+1))
      480                 let t = t, val = val; _all(i->val[i] isa fieldtype(t, i), 1:n); end
      528     return :($(esc(ex)) ? $(nothing) : throw(AssertionError($msg)))
     1456     esc(isa(ex, Expr) ? pushmeta!(ex, :noinline) : ex)
     2000     p = similar(Vector{eltype(ax)}, ax)
     4000     dest = similar(A, shape)

Note that if you run it several times you'll end up with several timestams. You can do ls -alth and see which ones are the newest. The number before mem is a sort of tag.

In my case, the newest one is 987335. So

➤ fd "\.mem" | grep 987335
julia/julia-1.6.0-rc1/share/julia/base/Base.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/abstractarray.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/array.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/arrayshow.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/boot.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/channels.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/compiler/abstractinterpretation.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/compiler/stmtinfo.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/coreio.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/deepcopy.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/deprecated.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/dict.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/error.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/errorshow.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/essentials.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/expr.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/float.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/generator.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/gmp.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/io.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/iterators.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/loading.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/meta.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/mpfr.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/multidimensional.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/multimedia.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/operators.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/ordering.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/process.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reduce.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reducedim.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/reflection.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/regex.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/set.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/show.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/sort.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/stream.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/strings/io.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/subarray.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/task.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/threads.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/tuple.jl.987335.mem
julia/julia-1.6.0-rc1/share/julia/base/util.jl.987335.mem

and then

➤ fd "\.mem" | grep 987335 | xargs sort -n | tail -n 20
        - "∀ ε > 0, ∃ δ > 0: |x-y| < δ ⇒ |f(x)-f(y)| < ε"
        - "εδxyδfxfyε"
        - "π"
        - "ℵ\\\\0"
        - "ℵ\\\\x000"
       32 println(io::IO, xs...) = print(io, xs..., "\n")
       48     sort!(p, alg, Perm(ordr,v))
       80     b = Vector{T}(undef, length(a))
       96 println(xs...) = println(stdout::IO, xs...)
      128         y = iterate(itr)
      144         show(io, x)
      224         collect_to_with_first!(_array_for(typeof(v1), itr.iter, isz), v1, itr, st)
      256         y = iterate(itr, st)
      432             print(io, x)
      448 indexed_iterate(t::Tuple, i::Int, state=1) = (@_inline_meta; (getfield(t, i), i+1))
      480                 let t = t, val = val; _all(i->val[i] isa fieldtype(t, i), 1:n); end
      528     return :($(esc(ex)) ? $(nothing) : throw(AssertionError($msg)))
     1456     esc(isa(ex, Expr) ? pushmeta!(ex, :noinline) : ex)
     2000     p = similar(Vector{eltype(ax)}, ax)
     4000     dest = similar(A, shape)

cscherrer avatar Mar 19 '21 00:03 cscherrer

With julia 1.8, this sample case generates 1k allocations instead of 1

julia> using AutoPreallocation

julia> function f()
                       x = rand(100)
                       nothing
                     end
f (generic function with 1 method)

julia> x, preallocated_f = preallocate(f);

julia> preallocated_f()


julia> @timev  preallocated_f()
  0.000053 seconds (1.20 k allocations: 24.906 KiB)
elapsed time (ns):  52981
gc time (ns):       0
bytes allocated:    25504
pool allocs:        1196
non-pool GC allocs: 0
minor collections:  0
full collections:   0

julia> @timev  f()
  0.000004 seconds (1 allocation: 896 bytes)
elapsed time (ns):  4020
gc time (ns):       0
bytes allocated:    896
pool allocs:        1
non-pool GC allocs: 0
minor collections:  0
full collections:   0

dpinol avatar Oct 17 '22 11:10 dpinol

This can't really be fixed here (except in special case), it needs to be fixed in Cassette or more likely by rewriting this with MixTape.jl

oxinabox avatar Oct 28 '22 18:10 oxinabox

I don't know much about implementation of this package or Cassette.jl it depends on (or Mixtape.jl). Should some issue be opened at Cassette.jl? Is that getting to be a redundant package?

Should this issue be kept open or not planned to be fixed? I'm not rushing anyone, I was just exploring and opened the issue, I'm not (really) a user. Possibly the status should be documented.

PallHaraldsson avatar Oct 28 '22 18:10 PallHaraldsson

This whole package should be documented as an experiment and largely as a deadend. The idea is fine but the ability to use it in practice is severely limited

oxinabox avatar Dec 08 '22 20:12 oxinabox

The same here:

julia> using Distributions, AutoPreallocation

julia> L = rand(2, 2);

julia> S = 2I + L * L';

julia> d = MvNormal(rand(2), S);

julia> x = rand(2);

julia> logpdf(d, x)
-2.8535848089575633

julia> @btime logpdf($d, $x)
  118.435 ns (3 allocations: 176 bytes)
-2.8535848089575633

julia> _, pf = preallocate(logpdf, d, x)
(-2.8535848089575633, preallocate(logpdf, ::FullNormal, ::Vector{Float64}))

julia> @btime $pf($d, $x)
  266.003 ns (9 allocations: 272 bytes)
-2.8535848089575633

bvdmitri avatar Jul 31 '23 20:07 bvdmitri