DSP.jl
DSP.jl copied to clipboard
Performance regression in conv for images
From slack: I recently noticed a performance regression in conv
using Images, DSP
# before pulling master
img = randn(1000,1000)
kernel = Kernel.LoG(4)
@btime DSP.conv($img, $(kernel)) # 42.050 ms (861 allocations: 61.68 MiB)
@btime imfilter!($storage, $img, $(kernel)) # 56.002 ms (967 allocations: 101.53 MiB)
After pulling DSP master to make use of the new algorithm I noticed that it is actually way slower
@btime DSP.conv($img, $(kernel)) # 73.747 ms (460183 allocations: 47.83 MiB)
Kernel.LoG(4)
returns a 35x35 OffsetArray. @galenlynch is that near the threshold where conv
switches between time-domain and overlap-save convolution? (Also I'd be curious whether the OffsetArray
matters or if you get the same with rand(35,35)
)
Todays timings are the following, kernel.parent
extracts the array underlying the OffsetArray
kernel 96.269 ms (472334 allocations: 48.01 MiB)
kernel.parent 102.519 ms (472762 allocations: 48.02 MiB)
For Kernel.LoG(2)
which yields a 17x17 matrix I see a substantial speedup, but not for LoG(3) which yields 27x27, I believe the step occurs at 25-26
actually I get this slightly confusing behaviour
Uh oh, I'll have to tweak the decision boundary between overlap-save and fast-fourier convolution.
I actually don't have the same pattern of results as @baggepinnen.
With master:
julia> using BenchmarkTools, Images, DSP
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.18 MiB
allocs estimate: 155
--------------
minimum time: 16.047 ms (0.00% GC)
median time: 17.332 ms (0.00% GC)
mean time: 18.028 ms (0.93% GC)
maximum time: 25.073 ms (0.00% GC)
--------------
samples: 278
evals/sample: 1
With the last release of DSP.jl (v0.6.2):
julia> using BenchmarkTools, Images, DSP
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 61.63 MiB
allocs estimate: 169
--------------
minimum time: 40.499 ms (1.22% GC)
median time: 42.498 ms (1.22% GC)
mean time: 42.732 ms (1.66% GC)
maximum time: 47.796 ms (2.30% GC)
--------------
samples: 117
evals/sample: 1
@ssfrr could you try it on your machine?
@baggepinnen the memory allocation estimate for your master version is exactly the same for my version 0.6.2... could you make sure you're using the most recent commit (f6c051d) for master?
Also 460183 allocations for your master is way too many. Something is wrong.
Here are the versions I used to run the benchmarks on master:
(dspdev) pkg> st
Status `~/.julia/environments/dspdev/Project.toml`
[6e4b80f9] BenchmarkTools v0.4.3
[717857b8] DSP v0.6.1 [`~/Documents/Code/DSP`]
[7a1cc6ca] FFTW v1.2.0
[916415d5] Images v0.20.1
[295af30f] Revise v2.5.0
And:
$ cd ~/Documents/Code/DSP/
$ git rev-parse HEAD
f6c051d562dd7422c52f9628fccf391c6413723b
The tip of master is now 92f4b2920196b54abde6fda7db97578c39e2f4b7
does it make a difference if one or both of the arrays are OffsetArray
s?
The benchmarks that I posted were offset arrays, and the graph used regular arrays.
On Tue, Jan 14, 2020, 2:38 PM Spencer Russell [email protected] wrote:
does it make a difference if one or both of the arrays are OffsetArrays?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaDSP/DSP.jl/issues/339?email_source=notifications&email_token=ABUDZT6QQSRZ2OOZK2VSBHTQ5YIDNA5CNFSM4KGUDGZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI53H3Q#issuecomment-574338030, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUDZT2WWELIMKDZCROCROTQ5YIDNANCNFSM4KGUDGZQ .
julia> using BenchmarkTools, Images, DSP
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> non_offset_kernel = kernel.parent;
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.18 MiB
allocs estimate: 155
--------------
minimum time: 16.183 ms (0.00% GC)
median time: 16.463 ms (0.00% GC)
mean time: 16.940 ms (0.56% GC)
maximum time: 25.176 ms (0.00% GC)
--------------
samples: 295
evals/sample: 1
julia> @benchmark DSP.conv($img, $non_offset_kernel)
BenchmarkTools.Trial:
memory estimate: 10.18 MiB
allocs estimate: 152
--------------
minimum time: 16.014 ms (0.00% GC)
median time: 16.227 ms (0.00% GC)
mean time: 16.370 ms (0.59% GC)
maximum time: 19.397 ms (0.00% GC)
--------------
samples: 306
evals/sample: 1
EDIT: these benchmarks were on f6c051d
I recompiled julia v1.4 from scratch, pulled latest DSP master and ran the benchmark again, results below
@btime DSP.conv($img, $(kernel)) # 658.782 ms (529855 allocations: 62.47 MiB)
julia> VERSION
v"1.4.0-DEV.672"
nichols ~/.julia/dev/DSP> git log
commit 92f4b2920196b54abde6fda7db97578c39e2f4b7 (HEAD -> master, origin/master, cache/heads/master)
Can you post your Pkg status? I was testing on Julia 1.3.1.
I still can't replicate your issue with Julia nightlies (v1.5)
julia> using DSP, Images, BenchmarkTools
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.18 MiB
allocs estimate: 155
--------------
minimum time: 16.336 ms (0.00% GC)
median time: 16.706 ms (0.00% GC)
mean time: 16.967 ms (0.65% GC)
maximum time: 21.964 ms (0.00% GC)
--------------
samples: 295
evals/sample: 1
julia> VERSION
v"1.5.0-DEV.67"
(@v1.5) pkg> st
Status `~/.julia/environments/v1.5/Project.toml`
[6e4b80f9] BenchmarkTools v0.4.3
[717857b8] DSP v0.6.2 [`~/.julia/dev/DSP`]
[916415d5] Images v0.20.1
$ cd ~/.julia/dev/DSP
$ git rev-parse HEAD
92f4b2920196b54abde6fda7db97578c39e2f4b7
Your @btime
output suggests type instability. If the issue is not with the package versions that you're using, then maybe try using the official binaries and seeing if the problem persists.
The reports I posted initially was produced on julia v1.3.1, only the last one was on 1.4.
(@v1.4) pkg> st
Status `~/.julia/environments/v1.4/Project.toml`
[c52e3926] Atom v0.11.3
[6e4b80f9] BenchmarkTools v0.4.3
[717857b8] DSP v0.6.2 [`~/.julia/dev/DSP`]
[28b8d3ca] GR v0.44.0
[916415d5] Images v0.20.1
[e5e0dc1b] Juno v0.7.2
[6e84ca90] LengthChannels v0.1.0 [`~/.julia/dev/LengthChannels`]
[91a5bcdd] Plots v0.27.1
[438e738f] PyCall v1.91.2
[d330b81b] PyPlot v2.8.2
[8ce77f84] Soss v0.7.0
[90137ffa] StaticArrays v0.12.1
[dbf13d8f] ThreadTools v0.1.1 [`~/.julia/dev/ThreadTools`]
[8149f6b0] WAV v1.0.3
I tried to recreate your environment (on Julia 1.3.1), but still do not see your performance problems nor do I see signs of type instability.
(issue339) pkg> st
Status `~/.julia/environments/issue339/Project.toml`
[c52e3926] Atom v0.11.3
[6e4b80f9] BenchmarkTools v0.4.3
[717857b8] DSP v0.6.2 [`~/.julia/dev/DSP`]
[28b8d3ca] GR v0.44.0
[916415d5] Images v0.20.1
[e5e0dc1b] Juno v0.7.2
[6e84ca90] LengthChannels v0.1.0
[91a5bcdd] Plots v0.27.1
[438e738f] PyCall v1.91.2
[d330b81b] PyPlot v2.8.2
[8ce77f84] Soss v0.7.0
[90137ffa] StaticArrays v0.12.1
[dbf13d8f] ThreadTools v0.1.1
[8149f6b0] WAV v1.0.3
julia> using DSP, Images, BenchmarkTools
[ Info: Precompiling Images [916415d5-f1e6-5110-898d-aaa5f9f070e0]
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.18 MiB
allocs estimate: 155
--------------
minimum time: 16.037 ms (0.00% GC)
median time: 16.733 ms (0.00% GC)
mean time: 17.199 ms (0.64% GC)
maximum time: 22.798 ms (2.69% GC)
--------------
samples: 291
evals/sample: 1
julia> VERSION
v"1.3.1"
$ cd ~/.julia/dev/DSP
$ git rev-parse HEAD
92f4b2920196b54abde6fda7db97578c39e2f4b7
$ git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
I don't really know how to proceed since I haven't been able to replicate your problem. I will try on a second computer to see if there's something weird with my work station. If you're interested in chasing this issue down, then maybe we could see what's going on using Cthulhu to locate the type instability on your system.
Hmm that is indeed strange. I have tried on both a laptop and an old desktop computer but see rather similar timings and the same memory footprint. The results below are on my desktop computer with julia v1.3.1 and both matrices in Float64
julia> using DSP, Images, BenchmarkTools
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 118.43 MiB
allocs estimate: 1309310
--------------
minimum time: 94.210 ms (0.00% GC)
median time: 128.897 ms (26.02% GC)
mean time: 123.987 ms (20.00% GC)
maximum time: 156.321 ms (24.46% GC)
--------------
samples: 41
evals/sample: 1
(v1.3) pkg> st
Status `~/.julia/environments/v1.3/Project.toml`
[c52e3926] Atom v0.11.3
[6e4b80f9] BenchmarkTools v0.4.3
[738e0a8b] BlobTracking v0.1.1 [`~/.julia/dev/BlobTracking`]
[3cb15238] ChangePrecision v1.0.0
[a6e380b2] ControlSystems v0.5.5 [`~/.julia/dev/ControlSystems`]
[717857b8] DSP v0.6.2 #master (https://github.com/JuliaDSP/DSP.jl.git)
[587475ba] Flux v0.10.1
[28b8d3ca] GR v0.44.0
[6a3955dd] ImageFiltering v0.6.9
[916415d5] Images v0.20.1
[e5e0dc1b] Juno v0.7.2
[eff96d63] Measurements v2.1.1
[0987c9cc] MonteCarloMeasurements v0.6.4 [`~/.julia/dev/MonteCarloMeasurements`]
[91a5bcdd] Plots v0.28.4
[476501e8] SLEEFPirates v0.1.3
[90137ffa] StaticArrays v0.12.1
julia> VERSION
v"1.3.1"
If I change to FLoat32 for both matrices, I get a result much closer to yours
julia> @benchmark DSP.conv($img, $Float32.(kernel.parent))
BenchmarkTools.Trial:
memory estimate: 7.61 MiB
allocs estimate: 31744
--------------
minimum time: 14.720 ms (0.00% GC)
median time: 15.139 ms (0.00% GC)
mean time: 15.896 ms (1.29% GC)
maximum time: 38.615 ms (0.00% GC)
--------------
samples: 315
evals/sample: 1
Is there some place where views are used that is inlined at your machine but not by my machine? That could potentially explain the many allocations since view allocations are mostly only elided if the function call is inlined.
I just installed Julia on a (slower) second computer and got the same results:
(v1.3) pkg> st
Status `~/.julia/environments/v1.3/Project.toml`
[6e4b80f9] BenchmarkTools v0.4.3
[717857b8] DSP v0.6.2 #master (DSP)
[916415d5] Images v0.20.1
julia> using DSP, BenchmarkTools, Images
[ Info: Precompiling DSP [717857b8-e6f2-59f4-9121-6e50c889abd2]
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
[ Info: Precompiling Images [916415d5-f1e6-5110-898d-aaa5f9f070e0]
julia> img = randn(1000,1000);
julia> kernel = Kernel.LoG(4);
julia> typeof(img)
Array{Float64,2}
julia> typeof(kernel)
OffsetArrays.OffsetArray{Float64,2,Array{Float64,2}}
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.18 MiB
allocs estimate: 155
--------------
minimum time: 19.372 ms (0.00% GC)
median time: 19.679 ms (0.00% GC)
mean time: 19.872 ms (0.66% GC)
maximum time: 24.839 ms (0.00% GC)
--------------
samples: 252
evals/sample: 1
julia> VERSION
v"1.3.1"
Maybe we could try with a fresh environment to rule out a few things. Could you make a new empty environment on the official Julia 1.3.1 binaries with just DSP
master (by running ]add DSP#master
, let's avoid dev
), Images
, and BenchmarkTools
? After that, could you run the benchmark, and post both the project and manifest files for that environment?
It would be quite disturbing if the inlining was significantly different between machines...
It's worth noting, I guess, that all of my tests have been on Linux. Are you using a different operating system?
I'm on Ubuntu on both of my machines, versioninfo(verbose=true)
below. Here's a completely fresh environment with official binary
/tmp/julia-1.3.1> bin/julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.3.1 (2019-12-30)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
(v1.3) pkg> activate /tmp
Activating new environment at `/tmp/Project.toml`
(tmp) pkg> add DSP#master
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Updating git-repo `https://github.com/JuliaDSP/DSP.jl.git`
Resolving package versions...
Installed SpecialFunctions ─ v0.10.0
Updating `/tmp/Project.toml`
[717857b8] + DSP v0.6.2 #master (https://github.com/JuliaDSP/DSP.jl.git)
Updating `/tmp/Manifest.toml`
[621f4979] + AbstractFFTs v0.5.0
[717857b8] + DSP v0.6.2 #master (https://github.com/JuliaDSP/DSP.jl.git)
[7a1cc6ca] + FFTW v1.2.0
[f5851436] + FFTW_jll v3.3.9+3
[1d5cc7b8] + IntelOpenMP_jll v2018.0.3+0
[c8e1da08] + IterTools v1.3.0
[856f044c] + MKL_jll v2019.0.117+2
[efe28fd5] + OpenSpecFun_jll v0.5.3+1
[f27b6e38] + Polynomials v0.6.0
[3cdcf5f2] + RecipesBase v0.7.0
[189a3867] + Reexport v0.2.0
[276daf66] + SpecialFunctions v0.10.0
[2a0f44e3] + Base64
[ade2ca70] + Dates
[8ba89e20] + Distributed
[b77e0a4c] + InteractiveUtils
[76f85450] + LibGit2
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[56ddb016] + Logging
[d6f4376e] + Markdown
[44cfe95a] + Pkg
[de0858da] + Printf
[3fa0cd96] + REPL
[9a3f8284] + Random
[ea8e919c] + SHA
[9e88b42a] + Serialization
[6462fe0b] + Sockets
[2f01184e] + SparseArrays
[10745b16] + Statistics
[8dfed614] + Test
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
(tmp) pkg> add BenchmarkTools
Resolving package versions...
Updating `/tmp/Project.toml`
[6e4b80f9] + BenchmarkTools v0.4.3
Updating `/tmp/Manifest.toml`
[6e4b80f9] + BenchmarkTools v0.4.3
[682c06a0] + JSON v0.21.0
[69de0a69] + Parsers v0.3.10
[a63ad114] + Mmap
julia> using DSP, BenchmarkTools
[ Info: Precompiling DSP [717857b8-e6f2-59f4-9121-6e50c889abd2]
julia> img = randn(1000,1000);
julia> kernel = randn(35,35);
julia> typeof(img)
Array{Float64,2}
julia> typeof(kernel)
Array{Float64,2}
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 118.44 MiB
allocs estimate: 1309744
--------------
minimum time: 88.884 ms (0.00% GC)
median time: 152.335 ms (22.79% GC)
mean time: 229.554 ms (12.58% GC)
maximum time: 554.478 ms (9.91% GC)
--------------
samples: 22
evals/sample: 1
julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
Ubuntu 18.04.1 LTS
uname: Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64
CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz:
speed user nice sys idle irq
#1 2442 MHz 413888 s 38233 s 84143 s 3291817 s 0 s
#2 2329 MHz 402000 s 31602 s 84474 s 1964856 s 0 s
#3 2335 MHz 309327 s 11461 s 106453 s 2008096 s 0 s
#4 2477 MHz 356923 s 36650 s 85812 s 1956186 s 0 s
Memory: 15.564483642578125 GB (502.1171875 MB free)
Uptime: 113146.0 sec
Load Avg: 0.5146484375 0.63671875 0.728515625
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
JULIA_EDITOR = atom
JULIA_NUM_THREADS = 4
MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
HOME = /home/fredrikb
WINDOWPATH = 2
TERM = xterm-256color
PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
Found it =) I have usually start julia with 4 threads. If I change to 1 thread I get your results
/tmp/julia-1.3.1> export JULIA_NUM_THREADS=1
/tmp/julia-1.3.1> bin/julia
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.3.1 (2019-12-30)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using DSP, BenchmarkTools
julia> img = randn(1000,1000);
julia> kernel = randn(35,35);
julia> typeof(img)
Array{Float64,2}
julia> typeof(kernel)
Array{Float64,2}
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.50 MiB
allocs estimate: 3722
--------------
minimum time: 24.705 ms (0.00% GC)
median time: 24.869 ms (0.00% GC)
mean time: 24.967 ms (0.37% GC)
maximum time: 25.530 ms (1.67% GC)
--------------
samples: 201
evals/sample: 1
Weird, 2 threads seems to be okay and it gets slightly faster, 4 threads causes a massive slowdown though
????????
Pretty sure this code doesn't explicitly use threading, and I thought FFTW and BLAS threads were handled separately from Julia threads? So weird.
seems FFTW.jl uses julia threading now https://github.com/JuliaMath/FFTW.jl/blob/20b5373d59cf4ce57442b3ac86258784700d6796/src/FFTW.jl#L29