DSP.jl icon indicating copy to clipboard operation
DSP.jl copied to clipboard

Performance regression in conv for images

Open baggepinnen opened this issue 4 years ago • 42 comments

From slack: I recently noticed a performance regression in conv

using Images, DSP

# before pulling master
img = randn(1000,1000)
kernel = Kernel.LoG(4)
@btime DSP.conv($img, $(kernel)) # 42.050 ms (861 allocations: 61.68 MiB)
@btime imfilter!($storage, $img, $(kernel)) # 56.002 ms (967 allocations: 101.53 MiB)

After pulling DSP master to make use of the new algorithm I noticed that it is actually way slower

@btime DSP.conv($img, $(kernel)) # 73.747 ms (460183 allocations: 47.83 MiB)

baggepinnen avatar Jan 14 '20 14:01 baggepinnen

Kernel.LoG(4) returns a 35x35 OffsetArray. @galenlynch is that near the threshold where conv switches between time-domain and overlap-save convolution? (Also I'd be curious whether the OffsetArray matters or if you get the same with rand(35,35))

ssfrr avatar Jan 14 '20 14:01 ssfrr

Todays timings are the following, kernel.parent extracts the array underlying the OffsetArray

kernel  96.269 ms (472334 allocations: 48.01 MiB)
kernel.parent  102.519 ms (472762 allocations: 48.02 MiB)

baggepinnen avatar Jan 14 '20 15:01 baggepinnen

For Kernel.LoG(2) which yields a 17x17 matrix I see a substantial speedup, but not for LoG(3) which yields 27x27, I believe the step occurs at 25-26

baggepinnen avatar Jan 14 '20 15:01 baggepinnen

newplot

actually I get this slightly confusing behaviour

baggepinnen avatar Jan 14 '20 16:01 baggepinnen

Uh oh, I'll have to tweak the decision boundary between overlap-save and fast-fourier convolution.

galenlynch avatar Jan 14 '20 17:01 galenlynch

I actually don't have the same pattern of results as @baggepinnen.

With master:

julia> using BenchmarkTools, Images, DSP
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.18 MiB
  allocs estimate:  155
  --------------
  minimum time:     16.047 ms (0.00% GC)
  median time:      17.332 ms (0.00% GC)
  mean time:        18.028 ms (0.93% GC)
  maximum time:     25.073 ms (0.00% GC)
  --------------
  samples:          278
  evals/sample:     1

With the last release of DSP.jl (v0.6.2):

julia> using BenchmarkTools, Images, DSP
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  61.63 MiB
  allocs estimate:  169
  --------------
  minimum time:     40.499 ms (1.22% GC)
  median time:      42.498 ms (1.22% GC)
  mean time:        42.732 ms (1.66% GC)
  maximum time:     47.796 ms (2.30% GC)
  --------------
  samples:          117
  evals/sample:     1

@ssfrr could you try it on your machine?

galenlynch avatar Jan 14 '20 17:01 galenlynch

@baggepinnen the memory allocation estimate for your master version is exactly the same for my version 0.6.2... could you make sure you're using the most recent commit (f6c051d) for master?

galenlynch avatar Jan 14 '20 17:01 galenlynch

Also 460183 allocations for your master is way too many. Something is wrong.

galenlynch avatar Jan 14 '20 17:01 galenlynch

Here are the versions I used to run the benchmarks on master:

(dspdev) pkg> st
    Status `~/.julia/environments/dspdev/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.3
  [717857b8] DSP v0.6.1 [`~/Documents/Code/DSP`]
  [7a1cc6ca] FFTW v1.2.0
  [916415d5] Images v0.20.1
  [295af30f] Revise v2.5.0

And:

$ cd ~/Documents/Code/DSP/
$ git rev-parse HEAD
f6c051d562dd7422c52f9628fccf391c6413723b

galenlynch avatar Jan 14 '20 17:01 galenlynch

The tip of master is now 92f4b2920196b54abde6fda7db97578c39e2f4b7

galenlynch avatar Jan 14 '20 18:01 galenlynch

master_img_benchmark

galenlynch avatar Jan 14 '20 19:01 galenlynch

does it make a difference if one or both of the arrays are OffsetArrays?

ssfrr avatar Jan 14 '20 19:01 ssfrr

The benchmarks that I posted were offset arrays, and the graph used regular arrays.

On Tue, Jan 14, 2020, 2:38 PM Spencer Russell [email protected] wrote:

does it make a difference if one or both of the arrays are OffsetArrays?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaDSP/DSP.jl/issues/339?email_source=notifications&email_token=ABUDZT6QQSRZ2OOZK2VSBHTQ5YIDNA5CNFSM4KGUDGZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI53H3Q#issuecomment-574338030, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUDZT2WWELIMKDZCROCROTQ5YIDNANCNFSM4KGUDGZQ .

galenlynch avatar Jan 14 '20 19:01 galenlynch

julia> using BenchmarkTools, Images, DSP
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> non_offset_kernel = kernel.parent;
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.18 MiB
  allocs estimate:  155
  --------------
  minimum time:     16.183 ms (0.00% GC)
  median time:      16.463 ms (0.00% GC)
  mean time:        16.940 ms (0.56% GC)
  maximum time:     25.176 ms (0.00% GC)
  --------------
  samples:          295
  evals/sample:     1

julia> @benchmark DSP.conv($img, $non_offset_kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.18 MiB
  allocs estimate:  152
  --------------
  minimum time:     16.014 ms (0.00% GC)
  median time:      16.227 ms (0.00% GC)
  mean time:        16.370 ms (0.59% GC)
  maximum time:     19.397 ms (0.00% GC)
  --------------
  samples:          306
  evals/sample:     1

EDIT: these benchmarks were on f6c051d

galenlynch avatar Jan 14 '20 20:01 galenlynch

I recompiled julia v1.4 from scratch, pulled latest DSP master and ran the benchmark again, results below

@btime DSP.conv($img, $(kernel)) # 658.782 ms (529855 allocations: 62.47 MiB)
julia> VERSION
v"1.4.0-DEV.672"

nichols ~/.julia/dev/DSP> git log
commit 92f4b2920196b54abde6fda7db97578c39e2f4b7 (HEAD -> master, origin/master, cache/heads/master)

baggepinnen avatar Jan 15 '20 05:01 baggepinnen

Can you post your Pkg status? I was testing on Julia 1.3.1.

galenlynch avatar Jan 15 '20 12:01 galenlynch

I still can't replicate your issue with Julia nightlies (v1.5)

julia> using DSP, Images, BenchmarkTools

julia> img = randn(1000,1000);

julia> kernel = Kernel.LoG(4);

julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.18 MiB
  allocs estimate:  155
  --------------
  minimum time:     16.336 ms (0.00% GC)
  median time:      16.706 ms (0.00% GC)
  mean time:        16.967 ms (0.65% GC)
  maximum time:     21.964 ms (0.00% GC)
  --------------
  samples:          295
  evals/sample:     1

julia> VERSION
v"1.5.0-DEV.67"

(@v1.5) pkg> st
Status `~/.julia/environments/v1.5/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.3
  [717857b8] DSP v0.6.2 [`~/.julia/dev/DSP`]
  [916415d5] Images v0.20.1
$ cd ~/.julia/dev/DSP
$ git rev-parse HEAD
92f4b2920196b54abde6fda7db97578c39e2f4b7

Your @btime output suggests type instability. If the issue is not with the package versions that you're using, then maybe try using the official binaries and seeing if the problem persists.

galenlynch avatar Jan 15 '20 16:01 galenlynch

The reports I posted initially was produced on julia v1.3.1, only the last one was on 1.4.

(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [717857b8] DSP v0.6.2 [`~/.julia/dev/DSP`]
  [28b8d3ca] GR v0.44.0
  [916415d5] Images v0.20.1
  [e5e0dc1b] Juno v0.7.2
  [6e84ca90] LengthChannels v0.1.0 [`~/.julia/dev/LengthChannels`]
  [91a5bcdd] Plots v0.27.1
  [438e738f] PyCall v1.91.2
  [d330b81b] PyPlot v2.8.2
  [8ce77f84] Soss v0.7.0
  [90137ffa] StaticArrays v0.12.1
  [dbf13d8f] ThreadTools v0.1.1 [`~/.julia/dev/ThreadTools`]
  [8149f6b0] WAV v1.0.3

baggepinnen avatar Jan 16 '20 13:01 baggepinnen

I tried to recreate your environment (on Julia 1.3.1), but still do not see your performance problems nor do I see signs of type instability.

(issue339) pkg> st
    Status `~/.julia/environments/issue339/Project.toml`
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [717857b8] DSP v0.6.2 [`~/.julia/dev/DSP`]
  [28b8d3ca] GR v0.44.0
  [916415d5] Images v0.20.1
  [e5e0dc1b] Juno v0.7.2
  [6e84ca90] LengthChannels v0.1.0
  [91a5bcdd] Plots v0.27.1
  [438e738f] PyCall v1.91.2
  [d330b81b] PyPlot v2.8.2
  [8ce77f84] Soss v0.7.0
  [90137ffa] StaticArrays v0.12.1
  [dbf13d8f] ThreadTools v0.1.1
  [8149f6b0] WAV v1.0.3

julia> using DSP, Images, BenchmarkTools
[ Info: Precompiling Images [916415d5-f1e6-5110-898d-aaa5f9f070e0]

julia> img = randn(1000,1000);

julia> kernel = Kernel.LoG(4);

julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.18 MiB
  allocs estimate:  155
  --------------
  minimum time:     16.037 ms (0.00% GC)
  median time:      16.733 ms (0.00% GC)
  mean time:        17.199 ms (0.64% GC)
  maximum time:     22.798 ms (2.69% GC)
  --------------
  samples:          291
  evals/sample:     1

julia> VERSION
v"1.3.1"
$ cd ~/.julia/dev/DSP
$ git rev-parse HEAD
92f4b2920196b54abde6fda7db97578c39e2f4b7
$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean

I don't really know how to proceed since I haven't been able to replicate your problem. I will try on a second computer to see if there's something weird with my work station. If you're interested in chasing this issue down, then maybe we could see what's going on using Cthulhu to locate the type instability on your system.

galenlynch avatar Jan 16 '20 15:01 galenlynch

Hmm that is indeed strange. I have tried on both a laptop and an old desktop computer but see rather similar timings and the same memory footprint. The results below are on my desktop computer with julia v1.3.1 and both matrices in Float64


julia> using DSP, Images, BenchmarkTools

julia> img = randn(1000,1000);

julia> kernel = Kernel.LoG(4);

julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  118.43 MiB
  allocs estimate:  1309310
  --------------
  minimum time:     94.210 ms (0.00% GC)
  median time:      128.897 ms (26.02% GC)
  mean time:        123.987 ms (20.00% GC)
  maximum time:     156.321 ms (24.46% GC)
  --------------
  samples:          41
  evals/sample:     1

(v1.3) pkg> st
    Status `~/.julia/environments/v1.3/Project.toml`
  [c52e3926] Atom v0.11.3
  [6e4b80f9] BenchmarkTools v0.4.3
  [738e0a8b] BlobTracking v0.1.1 [`~/.julia/dev/BlobTracking`]
  [3cb15238] ChangePrecision v1.0.0
  [a6e380b2] ControlSystems v0.5.5 [`~/.julia/dev/ControlSystems`]
  [717857b8] DSP v0.6.2 #master (https://github.com/JuliaDSP/DSP.jl.git)
  [587475ba] Flux v0.10.1
  [28b8d3ca] GR v0.44.0
  [6a3955dd] ImageFiltering v0.6.9
  [916415d5] Images v0.20.1
  [e5e0dc1b] Juno v0.7.2
  [eff96d63] Measurements v2.1.1
  [0987c9cc] MonteCarloMeasurements v0.6.4 [`~/.julia/dev/MonteCarloMeasurements`]
  [91a5bcdd] Plots v0.28.4
  [476501e8] SLEEFPirates v0.1.3
  [90137ffa] StaticArrays v0.12.1

julia> VERSION
v"1.3.1"

If I change to FLoat32 for both matrices, I get a result much closer to yours

julia> @benchmark DSP.conv($img, $Float32.(kernel.parent))
BenchmarkTools.Trial: 
  memory estimate:  7.61 MiB
  allocs estimate:  31744
  --------------
  minimum time:     14.720 ms (0.00% GC)
  median time:      15.139 ms (0.00% GC)
  mean time:        15.896 ms (1.29% GC)
  maximum time:     38.615 ms (0.00% GC)
  --------------
  samples:          315
  evals/sample:     1

baggepinnen avatar Jan 16 '20 16:01 baggepinnen

Is there some place where views are used that is inlined at your machine but not by my machine? That could potentially explain the many allocations since view allocations are mostly only elided if the function call is inlined.

baggepinnen avatar Jan 16 '20 16:01 baggepinnen

I just installed Julia on a (slower) second computer and got the same results:

(v1.3) pkg> st
    Status `~/.julia/environments/v1.3/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.3
  [717857b8] DSP v0.6.2 #master (DSP)
  [916415d5] Images v0.20.1

julia> using DSP, BenchmarkTools, Images
[ Info: Precompiling DSP [717857b8-e6f2-59f4-9121-6e50c889abd2]
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
[ Info: Precompiling Images [916415d5-f1e6-5110-898d-aaa5f9f070e0]

julia> img = randn(1000,1000);

julia> kernel = Kernel.LoG(4);

julia> typeof(img)
Array{Float64,2}

julia> typeof(kernel)
OffsetArrays.OffsetArray{Float64,2,Array{Float64,2}}

julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.18 MiB
  allocs estimate:  155
  --------------
  minimum time:     19.372 ms (0.00% GC)
  median time:      19.679 ms (0.00% GC)
  mean time:        19.872 ms (0.66% GC)
  maximum time:     24.839 ms (0.00% GC)
  --------------
  samples:          252
  evals/sample:     1

julia> VERSION
v"1.3.1"

Maybe we could try with a fresh environment to rule out a few things. Could you make a new empty environment on the official Julia 1.3.1 binaries with just DSP master (by running ]add DSP#master, let's avoid dev), Images, and BenchmarkTools? After that, could you run the benchmark, and post both the project and manifest files for that environment?

galenlynch avatar Jan 16 '20 16:01 galenlynch

It would be quite disturbing if the inlining was significantly different between machines...

galenlynch avatar Jan 16 '20 16:01 galenlynch

It's worth noting, I guess, that all of my tests have been on Linux. Are you using a different operating system?

galenlynch avatar Jan 16 '20 16:01 galenlynch

I'm on Ubuntu on both of my machines, versioninfo(verbose=true) below. Here's a completely fresh environment with official binary

/tmp/julia-1.3.1> bin/julia 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.1 (2019-12-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(v1.3) pkg> activate /tmp
Activating new environment at `/tmp/Project.toml`

(tmp) pkg> add DSP#master
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
  Updating git-repo `https://github.com/JuliaDSP/DSP.jl.git`
 Resolving package versions...
 Installed SpecialFunctions ─ v0.10.0
  Updating `/tmp/Project.toml`
  [717857b8] + DSP v0.6.2 #master (https://github.com/JuliaDSP/DSP.jl.git)
  Updating `/tmp/Manifest.toml`
  [621f4979] + AbstractFFTs v0.5.0
  [717857b8] + DSP v0.6.2 #master (https://github.com/JuliaDSP/DSP.jl.git)
  [7a1cc6ca] + FFTW v1.2.0
  [f5851436] + FFTW_jll v3.3.9+3
  [1d5cc7b8] + IntelOpenMP_jll v2018.0.3+0
  [c8e1da08] + IterTools v1.3.0
  [856f044c] + MKL_jll v2019.0.117+2
  [efe28fd5] + OpenSpecFun_jll v0.5.3+1
  [f27b6e38] + Polynomials v0.6.0
  [3cdcf5f2] + RecipesBase v0.7.0
  [189a3867] + Reexport v0.2.0
  [276daf66] + SpecialFunctions v0.10.0
  [2a0f44e3] + Base64 
  [ade2ca70] + Dates 
  [8ba89e20] + Distributed 
  [b77e0a4c] + InteractiveUtils 
  [76f85450] + LibGit2 
  [8f399da3] + Libdl 
  [37e2e46d] + LinearAlgebra 
  [56ddb016] + Logging 
  [d6f4376e] + Markdown 
  [44cfe95a] + Pkg 
  [de0858da] + Printf 
  [3fa0cd96] + REPL 
  [9a3f8284] + Random 
  [ea8e919c] + SHA 
  [9e88b42a] + Serialization 
  [6462fe0b] + Sockets 
  [2f01184e] + SparseArrays 
  [10745b16] + Statistics 
  [8dfed614] + Test 
  [cf7118a7] + UUIDs 
  [4ec0a83e] + Unicode 

(tmp) pkg> add BenchmarkTools
 Resolving package versions...
  Updating `/tmp/Project.toml`
  [6e4b80f9] + BenchmarkTools v0.4.3
  Updating `/tmp/Manifest.toml`
  [6e4b80f9] + BenchmarkTools v0.4.3
  [682c06a0] + JSON v0.21.0
  [69de0a69] + Parsers v0.3.10
  [a63ad114] + Mmap 

julia> using DSP, BenchmarkTools
[ Info: Precompiling DSP [717857b8-e6f2-59f4-9121-6e50c889abd2]

julia> img = randn(1000,1000);

julia> kernel = randn(35,35);

julia> typeof(img)
Array{Float64,2}

julia> typeof(kernel)
Array{Float64,2}

julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  118.44 MiB
  allocs estimate:  1309744
  --------------
  minimum time:     88.884 ms (0.00% GC)
  median time:      152.335 ms (22.79% GC)
  mean time:        229.554 ms (12.58% GC)
  maximum time:     554.478 ms (9.91% GC)
  --------------
  samples:          22
  evals/sample:     1

julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 18.04.1 LTS
  uname: Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  2442 MHz     413888 s      38233 s      84143 s    3291817 s          0 s
       #2  2329 MHz     402000 s      31602 s      84474 s    1964856 s          0 s
       #3  2335 MHz     309327 s      11461 s     106453 s    2008096 s          0 s
       #4  2477 MHz     356923 s      36650 s      85812 s    1956186 s          0 s
       
  Memory: 15.564483642578125 GB (502.1171875 MB free)
  Uptime: 113146.0 sec
  Load Avg:  0.5146484375  0.63671875  0.728515625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
  JULIA_EDITOR = atom
  JULIA_NUM_THREADS = 4
  MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
  DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
  HOME = /home/fredrikb
  WINDOWPATH = 2
  TERM = xterm-256color
  PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

baggepinnen avatar Jan 16 '20 16:01 baggepinnen

Found it =) I have usually start julia with 4 threads. If I change to 1 thread I get your results

/tmp/julia-1.3.1> export JULIA_NUM_THREADS=1
/tmp/julia-1.3.1> bin/julia 
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.3.1 (2019-12-30)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using DSP, BenchmarkTools

julia> img = randn(1000,1000);

julia> kernel = randn(35,35);

julia> typeof(img)
Array{Float64,2}

julia> typeof(kernel)
Array{Float64,2}

julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.50 MiB
  allocs estimate:  3722
  --------------
  minimum time:     24.705 ms (0.00% GC)
  median time:      24.869 ms (0.00% GC)
  mean time:        24.967 ms (0.37% GC)
  maximum time:     25.530 ms (1.67% GC)
  --------------
  samples:          201
  evals/sample:     1

baggepinnen avatar Jan 16 '20 16:01 baggepinnen

Weird, 2 threads seems to be okay and it gets slightly faster, 4 threads causes a massive slowdown though

baggepinnen avatar Jan 16 '20 16:01 baggepinnen

????????

galenlynch avatar Jan 16 '20 16:01 galenlynch

Pretty sure this code doesn't explicitly use threading, and I thought FFTW and BLAS threads were handled separately from Julia threads? So weird.

galenlynch avatar Jan 16 '20 17:01 galenlynch

seems FFTW.jl uses julia threading now https://github.com/JuliaMath/FFTW.jl/blob/20b5373d59cf4ce57442b3ac86258784700d6796/src/FFTW.jl#L29

baggepinnen avatar Jan 16 '20 17:01 baggepinnen