julia More efficient hvcat of scalars and arrays of numbers

First attempt to address #39713

Original:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
@btime [a c ; b d]   # 31 allocations and 1.25 kb

New:

@btime [a c ; b d]   # 15 allocations and 656 bytes

Others unchanged, as expected. ~~Though if different types of numbers are mixed, it still takes the longer path. I tried expanding the definition but there's some weird stuff going on that increases allocations in the other situations I posted in that issue.~~ Works for any Number element type.

Fixes #39713

Feb 18 '21 08:02 BioTurboNick

Oops, yep. completely broken. Should have guessed.

Feb 18 '21 13:02 BioTurboNick

const a, b, c, d, e, f = zeros(Int, 2, 2), [3 4], [2 ; 4], 5, 5.0, "string"
using BenchmarkTools
e19() = [a c ; b d]
e20() = [a c ; b e]
e21() = [a c ; b f]
e22() = [a c ; b [d]]
@btime e19() # 12 alloc: 528 bytes     ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e20() # 13 alloc: 544 bytes     ; 1.6.0-rc1: 39 alloc: 1.38 KiB
@btime e21() # <broken>                 ; 1.6.0-rc1: 31 alloc: 1.25 KiB
@btime e22() # 10 alloc: 512 bytes     ; 1.6.0-rc1: 10 alloc: 512 bytes

const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1

e6() = [x y ; x y]
@btime e6() # 4 alloc: 672 bytes        ; 1.6.0-rc1: 104 alloc: 4.98 KiB

This is what I achieved by consolidating with hvncat (once it's ready). Note it integrates #39725, which is currently broken for assigning strings into a range setindex.

Feb 19 '21 10:02 BioTurboNick

I haven't had time to look at this, but in case you're interested and if it helps, I recently sped up cat: #39292, #39294, #39314

Feb 19 '21 21:02 timholy

Cool! @timholy I think I noticed some of that effect without knowing the cause. Don't worry about reviewing this yet, it's dependent on my other PR #33697

Feb 19 '21 21:02 BioTurboNick

Updated this, now that the hvncat mechanism is merged:

julia> @btime e19()
  904.651 ns (8 allocations: 576 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 2.611 μs (31 allocations: 1.25 KiB)
#             master: 2.156 μs (30 allocations: 1.44 KiB)

julia> @btime e20()
  910.811 ns (9 allocations: 592 bytes)
3×3 Matrix{Float64}:
 0.0  0.0  2.0
 0.0  0.0  4.0
 3.0  4.0  5.0
# compare with 1.6.1: 2.811 μs (39 allocations: 1.38 KiB)
#             master: 2.267 μs (38 allocations: 1.56 KiB)

julia> @btime e21()
  847.368 ns (5 allocations: 432 bytes)
3×3 Matrix{Any}:
 0  0  2
 0  0  4
 3  4   "string"
# compare with 1.6.1: 2.678 μs (31 allocations: 1.25 KiB)
#             master: 2.167 μs (26 allocations: 1.25 KiB)

julia> @btime e22()
  873.333 ns (6 allocations: 528 bytes)
3×3 Matrix{Int64}:
 0  0  2
 0  0  4
 3  4  5
# compare with 1.6.1: 1.390 μs (10 allocations: 512 bytes)
#             master: 1.240 μs (10 allocations: 512 bytes)

julia> @btime e6() # 4 alloc: 672 bytes
  515.104 ns (5 allocations: 640 bytes)
4×4×2 Array{Int64, 3}:
[:, :, 1] =
 1  2  2  3
 3  4  4  5
 1  2  2  3
 3  4  4  5

[:, :, 2] =
 5  6  6  7
 7  8  8  9
 5  6  6  7
 7  8  8  9
# compare with 1.6.1: 4.414 μs (56 allocations: 2.41 KiB)
#             master: 3.337 μs (50 allocations: 2.39 KiB)

@btime [1 2; 4 5]
  29.879 ns (1 allocation: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 4  5
# compare with 1.6.1: 91.208 ns (2 allocations: 160 bytes)
#             master:  93.194 ns (2 allocations: 160 bytes)

julia> @btime [1 2; 4 fill(1, 1, 1, 1)]
  877.551 ns (9 allocations: 624 bytes)
2×2 Matrix{Int64}:
 1  2
 4  1
# compare with 1.6.1: 4.188 μs (43 allocations: 1.62 KiB)
#             master: 3.487 μs (42 allocations: 1.81 KiB)

Jun 06 '21 00:06 BioTurboNick

With respect to the original motivating examples:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 2.922 μs (31 allocations: 1.25 KiB); now 897.778 ns (8 allocations: 576 bytes)
@btime [a c ; [b d]] # 2.322 μs (21 allocations: 880 bytes); 1.690 μs (26 allocations: 1.30 KiB)
@btime [[a c] ; [b d]] # 975.000 ns 16 allocations and 816 bytes -- explicit hcat nested within vcat (should be worst case)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 1.410 μs ( (10 allocations: 512 bytes); now 865.957 ns (6 allocations: 528 bytes)
@btime [a c ; [b [d]]] # 1.230 μs (9 allocations: 560 bytes); now 816.250 ns (14 allocations: 1008 bytes)
@btime [[a c] ; [b [d]]] # 142.069 ns (4 allocations: 496 bytes) -- explicit hcat nested within vcat (should be worst case)

So better. Still some more improvements to make. Not entirely sure why the middle case is relatively poor.

Jun 06 '21 00:06 BioTurboNick

Checking in on what this is like on the current master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 1.350 μs (17 allocations: 944 bytes)
@btime [a c ; [b d]] # 1.070 μs (8 allocations: 400 bytes)
@btime [[a c] ; [b d]] # 144.934 ns (3 allocations: 320 bytes)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 1.260 μs (10 allocations: 448 bytes)
@btime [a c ; [b [d]]] # 1.070 μs (9 allocations: 464 bytes)
@btime [[a c] ; [b [d]]] # 155.417 ns (4 allocations: 384 bytes)

And with this PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 703.497 ns (23 allocations: 1.22 KiB)
@btime [a c ; [b d]] # 1.120 μs (19 allocations: 928 bytes)
@btime [[a c] ; [b d]] # 144.577 ns (3 allocations: 320 bytes)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 412.000 ns (5 allocations: 400 bytes)
@btime [a c ; [b [d]]] # 1.120 μs (20 allocations: 992 bytes)
@btime [[a c] ; [b [d]]] # 155.168 ns (4 allocations: 384 bytes)

This is before additional needed work.

Current status:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools
# mixed arrays and scalars
@btime [a c ; b d]   # 703.497 ns (23 allocations: 1.22 KiB)
@btime [a c ; [b d]] # 595.531 ns (12 allocations: 720 bytes)
@btime [[a c] ; [b d]] # 144.577 ns (3 allocations: 320 bytes)
# scalars wrapped in arrays
@btime [a c ; b [d]] # 412.000 ns (5 allocations: 400 bytes)
@btime [a c ; [b [d]]] # 603.409 (13 allocations: 784 bytes)
@btime [[a c] ; [b [d]]] # 155.168 ns (4 allocations: 384 bytes)

May 19 '22 03:05 BioTurboNick

Current status of this PR vs. Master:

(Some timing differences from previous test likely due to my new computer)

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 259.371 ns (22 allocations: 1.17 KiB) +++
@btime [a c ; [b d]] # 1.612 μs (26 allocations: 1.12 KiB) ---
@btime [[a c] ; [b d]] # 497.052 ns (16 allocations: 720 bytes) ---

# scalars wrapped in arrays
@btime [a c ; b [d]] # 189.981 ns (4 allocations: 352 bytes)
@btime [a c ; [b [d]]] # 1.161 μs (14 allocations: 816 bytes) ---
@btime [[a c] ; [b [d]]] # 84.902 ns (4 allocations: 384 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 628.402 ns (17 allocations: 944 bytes)
@btime [a c ; [b d]] # 1.110 μs (21 allocations: 800 bytes)
@btime [[a c] ; [b d]] # 480.513 ns (16 allocations: 720 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 801.111 ns (10 allocations: 448 bytes)
@btime [a c ; [b [d]]] # 665.161 ns (9 allocations: 464 bytes)
@btime [[a c] ; [b [d]]] # 79.321 ns (4 allocations: 384 bytes)

+++ = Improved since last test
--- = Regression since last test

So currently, Master is actually faster or equivalent for 4 of the 6 cases now, which is great. So will need to think if this PR is truly necessary anymore, and see what can improve that last case.

Aug 13 '23 21:08 BioTurboNick

Updating based on current master:

This PR:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 2.473 μs (43 allocations: 2.03 KiB)
@btime [a c ; [b d]] # 1.561 μs (33 allocations: 1.17 KiB)
@btime [[a c] ; [b d]] # 467.454 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 176.155 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 1.088 μs (22 allocations: 864 bytes)
@btime [[a c] ; [b [d]]] # 57.834 ns (8 allocations: 432 bytes)

Master:

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 564.249 ns (17 allocations: 816 bytes)
@btime [a c ; [b d]] # 1.041 μs (23 allocations: 832 bytes)
@btime [[a c] ; [b d]] # 470.286 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 708.369 ns (12 allocations: 464 bytes)
@btime [a c ; [b [d]]] # 585.099 ns (12 allocations: 496 bytes)
@btime [[a c] ; [b [d]]] # 59.694 ns (8 allocations: 432 bytes)

All cases on master are all faster than they were, even in the presence of additional allocations. I'll see about whether this PR can be wrapped up one way or the other this weekend I think.

Feb 10 '24 16:02 BioTurboNick

New simple approach - just dispatching hvcat to the hvncat_shape methods. Back to being similar or faster than master. But need to verify I didn't slow any other conditions down.

EDIT: updated timings after better addressing dynamic dispatch of cat_similar

const a, b, c, d = zeros(Int, 2, 2), [3 4], [2 ; 4], 5
using BenchmarkTools

# mixed arrays and scalars
@btime [a c ; b d]   # 150.041 ns (25 allocations: 1.19 KiB)
@btime [a c ; [b d]] # 565.120 ns (31 allocations: 1.11 KiB)
@btime [[a c] ; [b d]] # 472.211 ns (19 allocations: 768 bytes)

# scalars wrapped in arrays
@btime [a c ; b [d]] # 85.399 ns (8 allocations: 368 bytes)
@btime [a c ; [b [d]]] # 148.468 ns (20 allocations: 800 bytes)
@btime [[a c] ; [b [d]]] # 58.048 ns (8 allocations: 432 bytes)

Feb 11 '24 20:02 BioTurboNick

The remaining slowdown I'm seeing relates to the hcat methods.

In 1.10, hcat for an array and a scalar dispatches to a SparseArrays method (even though I haven't loaded SparseArrays?) which then redirects to typed_hcat, and then to _cat_t(Val(2), T, X...), which is very fast.

On master, the fallback abstractarrays.jl hcat is called instead, which feeds into the slower cat(A...; dims=Val(2)) instead.

Possibly related to #49322 ? Looks like a regression in cat, haven't fully investigated:

#1.10:
julia> @btime cat(b, d, dims = 2)
#  24.975 ns (1 allocation: 80 bytes)

#master:
julia> @btime cat(b, d, dims = 2)
#  554.371 ns (19 allocations: 576 bytes)

But we can probably ignore that for the purposes of this PR, which is about hvcat.

Feb 12 '24 09:02 BioTurboNick

I wonder if this resolves the following issue:

using LinearAlgebra
function measure()
  data = rand(3,3)
  mat = [
    data ones(size(data, 1))
    ones(size(data, 2))' 0
  ]
  det(mat)
end
@code_warntype measure()  # -> can't infer type of `mat`

There has been some discussion at https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Matrix.20comprehension.20failed.20type.20inference.

Mar 10 '24 16:03 RomeoV

Seems like it does!

MethodInstance for measure()
  from measure() @ Main REPL[2]:1
Arguments
  #self#::Core.Const(measure)
Locals
  mat::Matrix{Float64}
  data::Matrix{Float64}
Body::Float64
1 ─ %1  = Main.rand::Core.Const(rand)
│         (data = (%1)(3, 3))
│   %3  = Core.tuple(2, 2)::Core.Const((2, 2))
│   %4  = data::Matrix{Float64}
│   %5  = Main.ones::Core.Const(ones)
│   %6  = Main.size::Core.Const(size)
│   %7  = data::Matrix{Float64}
│   %8  = (%6)(%7, 1)::Int64
│   %9  = (%5)(%8)::Vector{Float64}
│   %10 = Main.:var"'"::Core.Const(adjoint)
│   %11 = Main.ones::Core.Const(ones)
│   %12 = Main.size::Core.Const(size)
│   %13 = data::Matrix{Float64}
│   %14 = (%12)(%13, 2)::Int64
│   %15 = (%11)(%14)::Vector{Float64}
│   %16 = (%10)(%15)::Adjoint{Float64, Vector{Float64}}
│         (mat = Base.hvcat(%3, %4, %9, %16, 0))
│   %18 = Main.det::Core.Const(LinearAlgebra.det)
│   %19 = mat::Matrix{Float64}
│   %20 = (%18)(%19)::Float64
└──       return %20

Mar 10 '24 18:03 BioTurboNick

julia julia copied to clipboard

More efficient hvcat of scalars and arrays of numbers

julia
julia copied to clipboard