Poor performance of `getkey` compared to docs benchmarks
I have rerun the first set of benchmarks from /docs/speed.jl comparing getindex to getkey. In many instances getkey performs worse (both relative to getindex and relative to the values in speed.jl comments). For getkey in many instances I saw substantially worse performance (> 1 order of magnitude) compared to the getindex calls on my machine, including some allocations where there seemed to be none for the results in the comments.
I have tested this on both 1.4.2 and 1.6.1. 1.6.1 gave me ~2x speedup over the 1.4.2 results, but the numbers were still far off of what's present in the comments for getkey. The environment I ran this in was just the AxisKeys project environment with the latest version of all dependencies + BenchmarkTools. Machine information is included at the end.
Here are my results of running the benchmarks. I included the original values, and then appended my results from @btime afterwards in parentheses.
using AxisKeys, BenchmarkTools # Julia 1.4.2
#==============================#
#===== getkey vs getindex =====#
#==============================#
mat = wrapdims(rand(3,4), 11:13, 21:24)
bothmat = wrapdims(mat.data, x=11:13, y=21:24)
bothmat2 = wrapdims(mat.data, x=collect(11:13), y=collect(21:24))
@btime $mat[3, 4] # 1.699 ns (1.699 ns)
@btime $mat(13, 24) # 5.312 ns (124.232 ns (1 allocation: 32 bytes))
@btime $bothmat[3,4] # 1.700 ns (1.649 ns)
@btime $bothmat[x=3, y=4] # 1.701 ns (1.965 ns)
@btime $bothmat(13, 24) # 5.874 ns (124.299 ns (1 allocation: 32 bytes))
@btime $bothmat(x=13, y=24) # 14.063 ns (132.624 ns (1 allocation: 32 bytes))
@btime $bothmat2(13, 24) # 16.719 ns (13.606 ns)
ind_collect(A) = [@inbounds(A[ijk...]) for ijk in Iterators.ProductIterator(axes(A))]
key_collect(A) = [@inbounds(A(vals...)) for vals in Iterators.ProductIterator(axiskeys(A))]
bigmat = wrapdims(rand(100,100), 1:100, 1:100);
bigmat2 = wrapdims(rand(100,100), collect(1:100), collect(1:100));
@btime ind_collect($(bigmat.data)); # 9.117 μs (4 allocations: 78.25 KiB) (12.646 μs (4 allocations: 78.25 KiB))
@btime ind_collect($bigmat); # 11.530 μs (4 allocations: 78.25 KiB) (9.996 μs (4 allocations: 78.25 KiB))
@btime key_collect($bigmat); # 64.064 μs (4 allocations: 78.27 KiB) (1.248 ms (10004 allocations: 390.77 KiB))
@btime key_collect($bigmat2); # 718.804 μs (5 allocations: 78.27 KiB) (625.003 μs (5 allocations: 78.27 KiB))
twomat = wrapdims(mat.data, x=[:a, :b, :c], y=21:24)
@btime $twomat(x=:a, y=24) # 36.734 ns (2 allocations: 64 bytes) (137.505 ns (2 allocations: 64 bytes))
@btime $twomat(24.0) # 26.686 ns (4 allocations: 112 bytes) (44.374 ns (5 allocations: 144 bytes))
@btime $twomat(y=24.0) # 33.860 ns (4 allocations: 112 bytes) (49.411 ns (6 allocations: 160 bytes))
@btime view($twomat, :,3) # 24.951 ns (4 allocations: 112 bytes) (22.302 ns (4 allocations: 112 bytes))
Here's the version of everything from my instantiation:
(AxisKeys) pkg> st
Project AxisKeys v0.1.16
Status `~/Documents/AxisKeys.jl/Project.toml`
[621f4979] AbstractFFTs v1.0.1
[587fd27a] CovarianceEstimation v0.2.6
[8197267c] IntervalSets v0.5.3
[41ab1584] InvertedIndices v1.0.0
[1fad7336] LazyStack v0.0.7
[356022a1] NamedDims v0.2.29
[6fe1bfb0] OffsetArrays v1.10.0
[2913bbd2] StatsBase v0.33.8
[bd369af6] Tables v1.4.3
[37e2e46d] LinearAlgebra
[10745b16] Statistics
I'm on a Thinkpad X1 Carbon 7th gen running Ubuntu 20.04. Here's the CPU/RAM info from lshw:
H/W path Device Class Description
===============================================================
system 20QDS3DQ00 (LENOVO_MT_20QD_BU_Think_FM_ThinkPad X1 Carbon
/0 bus 20QDS3DQ00
/0/2 memory 16GiB System Memory
/0/2/0 memory 8GiB Row of chips LPDDR3 Synchronous 2133 MHz (0.5 ns)
/0/2/1 memory 8GiB Row of chips LPDDR3 Synchronous 2133 MHz (0.5 ns)
/0/c memory 256KiB L1 cache
/0/d memory 1MiB L2 cache
/0/e memory 8MiB L3 cache
/0/f processor Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
/0/11 memory 128KiB BIOS
/0/100 bridge Coffee Lake HOST and DRAM Controller
/0/100/2 display UHD Graphics 620 (Whiskey Lake)
/0/100/4 generic Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal
/0/100/8 generic Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Pr
That's not good! Thanks for the careful investigation.
It looks like this is the key difference -- the lookup from a Vector is similar, but the lookup from a range 1:100 (without collect) has become very slow. I get something similar, locally, too:
@btime key_collect($bigmat); # 64.064 μs (4 allocations: 78.27 KiB) (1.248 ms (10004 allocations: 390.77 KiB))
@btime key_collect($bigmat2); # 718.804 μs (5 allocations: 78.27 KiB) (625.003 μs (5 allocations: 78.27 KiB))
I rolled back to 0.1.0 and tested that as well
@btime key_collect($bigmat); # 64.064 μs (4 allocations: 78.27 KiB) (1.264 ms (10004 allocations: 390.77 KiB))
So it seems this is an issue with some dependency rather than an AxisKeys issue :)
I'll make sure we avoid UnitRange indices for the time being, I don't think that'll cause any problems for us. Thanks!