SparseArrays.jl
SparseArrays.jl copied to clipboard
Regression with arithmetic operations on sparse matrices
Possibly related to JuliaLang/julia#21370:
On 0.5.1:
julia> A = sprand(10,10,0.2);
julia> @benchmark 3.0*A
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 4
--------------
minimum time: 139.775 ns (0.00% GC)
median time: 153.360 ns (0.00% GC)
mean time: 174.615 ns (9.75% GC)
maximum time: 1.132 μs (79.94% GC)
--------------
samples: 10000
evals/sample: 870
julia> @benchmark 3.0+A
BenchmarkTools.Trial:
memory estimate: 1.75 KiB
allocs estimate: 2
--------------
minimum time: 237.948 ns (0.00% GC)
median time: 267.440 ns (0.00% GC)
mean time: 303.627 ns (8.86% GC)
maximum time: 1.451 μs (66.51% GC)
--------------
samples: 10000
evals/sample: 464
julia> @benchmark A/3.0
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 4
--------------
minimum time: 187.120 ns (0.00% GC)
median time: 197.874 ns (0.00% GC)
mean time: 220.143 ns (8.00% GC)
maximum time: 1.349 μs (81.50% GC)
--------------
samples: 10000
evals/sample: 723
On 0.6
julia> A = sprand(10,10,0.2);
julia> @benchmark 3.0*A
BenchmarkTools.Trial:
memory estimate: 720 bytes
allocs estimate: 7
--------------
minimum time: 242.706 ns (0.00% GC)
median time: 261.936 ns (0.00% GC)
mean time: 296.619 ns (9.39% GC)
maximum time: 3.182 μs (80.41% GC)
--------------
samples: 10000
evals/sample: 469
julia> @benchmark 3.0+A
BenchmarkTools.Trial:
memory estimate: 2.02 KiB
allocs estimate: 7
--------------
minimum time: 618.483 ns (0.00% GC)
median time: 659.125 ns (0.00% GC)
mean time: 727.738 ns (7.16% GC)
maximum time: 5.575 μs (81.81% GC)
--------------
samples: 10000
evals/sample: 176
julia> @benchmark A/3.0
BenchmarkTools.Trial:
memory estimate: 720 bytes
allocs estimate: 7
--------------
minimum time: 281.785 ns (0.00% GC)
median time: 303.701 ns (0.00% GC)
mean time: 339.489 ns (8.10% GC)
maximum time: 4.909 μs (86.33% GC)
--------------
samples: 10000
evals/sample: 298
Julia Version 0.6.0-pre.beta.253
Commit 6e70552* (2017-04-22 13:14 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, sandybridge)
We have had several reports of sparse matrix regressions. Probably worth taking a good look at what is going on and expanding the sparse matrix benchmarks in test/perf
.
test/perf
is not especially useful, it very rarely gets run - BaseBenchmarks.jl is where we should add things so nanosoldier tracks them
Results are similar when interpolating A
into the benchmarks:
julia> versioninfo()
Julia Version 0.5.1
Commit 6445c82* (2017-03-05 13:25 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)
julia> A = sprand(10, 10, 0.2);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
samples: 10000
evals/sample: 849
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 736.00 bytes
allocs estimate: 4
minimum time: 146.00 ns (0.00% GC)
median time: 166.00 ns (0.00% GC)
mean time: 215.95 ns (15.66% GC)
maximum time: 2.27 μs (88.80% GC)
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
samples: 10000
evals/sample: 282
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.75 kb
allocs estimate: 2
minimum time: 290.00 ns (0.00% GC)
median time: 319.00 ns (0.00% GC)
mean time: 390.04 ns (10.10% GC)
maximum time: 4.27 μs (0.00% GC)
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
samples: 10000
evals/sample: 685
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 736.00 bytes
allocs estimate: 4
minimum time: 186.00 ns (0.00% GC)
median time: 206.00 ns (0.00% GC)
mean time: 260.94 ns (13.07% GC)
maximum time: 2.91 μs (85.06% GC)
and
julia> versioninfo()
Julia Version 0.6.0-pre.beta.305
Commit f56147d* (2017-04-24 18:38 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)
julia> A = sprand(10, 10, 0.2);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 7
--------------
minimum time: 238.269 ns (0.00% GC)
median time: 258.608 ns (0.00% GC)
mean time: 329.116 ns (12.51% GC)
maximum time: 4.580 μs (86.26% GC)
--------------
samples: 10000
evals/sample: 438
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
memory estimate: 2.02 KiB
allocs estimate: 7
--------------
minimum time: 644.115 ns (0.00% GC)
median time: 707.595 ns (0.00% GC)
mean time: 858.236 ns (8.50% GC)
maximum time: 8.213 μs (80.75% GC)
--------------
samples: 10000
evals/sample: 174
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 7
--------------
minimum time: 256.019 ns (0.00% GC)
median time: 281.244 ns (0.00% GC)
mean time: 355.654 ns (11.75% GC)
maximum time: 5.543 μs (89.03% GC)
--------------
samples: 10000
evals/sample: 362
Also note that discrepancies persist in the multiplication and division cases for more reasonably sized (though still small) sparse matrices, with on 0.5.1:
julia> A = sprand(1000, 1000, 0.01);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 167.58 kb
allocs estimate: 6
minimum time: 16.95 μs (0.00% GC)
median time: 25.26 μs (0.00% GC)
mean time: 50.63 μs (29.92% GC)
maximum time: 3.43 ms (77.29% GC)
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
samples: 795
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 15.26 mb
allocs estimate: 4
minimum time: 4.10 ms (0.00% GC)
median time: 6.38 ms (28.68% GC)
mean time: 6.28 ms (27.32% GC)
maximum time: 9.35 ms (42.09% GC)
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 167.58 kb
allocs estimate: 6
minimum time: 28.98 μs (0.00% GC)
median time: 36.61 μs (0.00% GC)
mean time: 61.14 μs (22.96% GC)
maximum time: 2.96 ms (75.51% GC)
and master
julia> A = sprand(1000, 1000, 0.01);
julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
memory estimate: 167.75 KiB
allocs estimate: 9
--------------
minimum time: 28.489 μs (0.00% GC)
median time: 37.700 μs (0.00% GC)
mean time: 63.382 μs (23.70% GC)
maximum time: 3.182 ms (97.13% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
memory estimate: 15.27 MiB
allocs estimate: 9
--------------
minimum time: 3.022 ms (0.00% GC)
median time: 5.242 ms (33.51% GC)
mean time: 5.098 ms (31.28% GC)
maximum time: 76.732 ms (94.91% GC)
--------------
samples: 978
evals/sample: 1
julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
memory estimate: 167.75 KiB
allocs estimate: 9
--------------
minimum time: 40.635 μs (0.00% GC)
median time: 50.794 μs (0.00% GC)
mean time: 78.106 μs (19.19% GC)
maximum time: 3.396 ms (95.61% GC)
--------------
samples: 10000
evals/sample: 1
The overarching reason for these performance changes: In 0.5, arithmetic operations between sparse matrices and numbers translate to equivalent dot operations, e.g. 3.0 + A
translates to 3.0 .+ A
. Each such dot operation has a dedicated specialization, e.g. 3.0 .+ A
is implemented as 3.0 .+ full(A)
. In 0.6, however, all arithmetic operations between sparse matrices and numbers translate to equivalent broadcast
calls and thereby hit generic sparse broadcast
. Consequently both semantics and performance have changed.
Note that the former specializations were extremely simple and not universally correct. For example, the specializations that the multiplication and division examples above hit just applied the corresponding operations to the sparse matrix's nonzero vector, yielding incorrect results in corner cases such as the scalar being Inf
, NaN
, 0
, et cetera. Similarly, as noted above the addition example formerly fell back to dense methods.
Given the semantic changes, I'm not certain the performance comparison is particularly meaningful. (That said, some tweaks to generic sparse broadcast
I have in mind might improve performance --- though by a couple tens of percent at best I wager). Best!
Couldn't we have something like this for multiplication and division:
julia> sp_mul_number(A,n) = SparseMatrixCSC(A.m,A.n,A.colptr,A.rowval,n*A.nzval)
sp_mul_number (generic function with 1 method)
julia> @benchmark 3.26*$A
BenchmarkTools.Trial:
memory estimate: 3.67 KiB
allocs estimate: 7
--------------
minimum time: 551.559 ns (0.00% GC)
median time: 614.414 ns (0.00% GC)
mean time: 685.501 ns (8.68% GC)
maximum time: 3.800 μs (68.65% GC)
--------------
samples: 10000
evals/sample: 186
julia> @benchmark sp_mul_number($A,3.26)
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 1
--------------
minimum time: 51.266 ns (0.00% GC)
median time: 51.561 ns (0.00% GC)
mean time: 53.618 ns (2.20% GC)
maximum time: 776.661 ns (83.50% GC)
--------------
samples: 10000
evals/sample: 992
julia> isequal(3.26*A,sp_mul_number(A,3.26))
true
That definition is similar in spirit to the above-mentioned specializations existing in 0.5:
(.*)(A::SparseMatrixCSC, B::Number) = SparseMatrixCSC(A.m, A.n, copy(A.colptr), copy(A.rowval), A.nzval .* B)
(.*)(A::Number, B::SparseMatrixCSC) = SparseMatrixCSC(B.m, B.n, copy(B.colptr), copy(B.rowval), A .* B.nzval)
(./)(A::SparseMatrixCSC, B::Number) = SparseMatrixCSC(A.m, A.n, copy(A.colptr), copy(A.rowval), A.nzval ./ B)
(You need copy
the colptr
and rowval
arrays to avoid incorrect aliasing.) Unfortunately such definitions suffer from the semantic issues also mentioned above (e.g. failure in corner cases). Best!
Right, sorry I hadn't understood your reply.
What would be the correct way to deal with the number being infinite or NaN (etc)? What is broadcast now doing?
I'm assuming that simply coding in the behaviour for these special cases would not work.
Right, sorry I hadn't understood your reply.
No worries! :)
What would be the correct way to deal with the number being infinite or NaN (etc)? What is broadcast now doing?
broadcast
returns the result of performing the scalar arithmetic operation elementwise.
I'm assuming that simply coding in the behaviour for these special cases would not work.
Writing specializations of one form or another for each such operation is of course possible. But the net result would be a collection of ad hoc methods closely resembling both each other and sparse broadcast
(avoiding which was part of the purpose of the changes between 0.5 and 0.6 relevant to this issue). And how much performance improvement those specializations might yield (while retaining the correct semantics) is not clear. Chances are effort would be better spent refining sparse broadcast
.
For example, improving the mechanism that handles scalars in sparse broadcast
could yield ~20% runtime reduction on these benchmarks:
julia> @benchmark broadcast(x -> 3.0 * x, $A)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 4
--------------
minimum time: 225.231 ns (0.00% GC)
median time: 239.024 ns (0.00% GC)
mean time: 297.230 ns (11.20% GC)
maximum time: 3.230 μs (84.41% GC)
--------------
samples: 10000
evals/sample: 455
julia> @benchmark broadcast(*, 3.0, $A)
BenchmarkTools.Trial:
memory estimate: 944 bytes
allocs estimate: 7
--------------
minimum time: 279.579 ns (0.00% GC)
median time: 295.338 ns (0.00% GC)
mean time: 374.913 ns (11.91% GC)
maximum time: 5.898 μs (85.46% GC)
--------------
samples: 10000
evals/sample: 318
julia> @benchmark broadcast(x -> x / 3.0, $A)
BenchmarkTools.Trial:
memory estimate: 896 bytes
allocs estimate: 4
--------------
minimum time: 263.210 ns (0.00% GC)
median time: 280.946 ns (0.00% GC)
mean time: 342.693 ns (9.78% GC)
maximum time: 3.941 μs (74.48% GC)
--------------
samples: 10000
evals/sample: 372
julia> @benchmark broadcast(/, $A, 3.0)
BenchmarkTools.Trial:
memory estimate: 944 bytes
allocs estimate: 7
--------------
minimum time: 300.787 ns (0.00% GC)
median time: 320.508 ns (0.00% GC)
mean time: 400.012 ns (10.91% GC)
maximum time: 6.291 μs (86.89% GC)
--------------
samples: 10000
evals/sample: 258
Best!
@Sacha0 Would it be possible to get back to 0.5 performance if structural zeros didn't follow IEEE in broadcast
?
@Sacha0 how is scaling the vector of nonzero values different than applying the same operation elementwise?
If multiplying by zeros, all the nonzero values become 0 and the resulting sparse matrix is empty. If multiplying by infinity, nonzero values become infinite. The only case I can see a difference would be for NaN, where the whole matrix should become filled with NaNs. Is that correct?
If multiplying by zeros, all the nonzero values become 0 and the resulting sparse matrix is empty.
Not quite: Inf and NaN entries become NaN.
The only case I can see a difference would be for NaN, where the whole matrix should become filled with NaNs.
Also when multiplying with Inf, zero becomes NaN, so the whole matrix will be filled with a mixture of Infs and NaNs.
Would it be possible to get back to 0.5 performance if structural zeros didn't follow IEEE in broadcast?
Without also nixing the other semantic changes? My (approximately meaningless) guess is no: Outperforming [uniformly scaling a vector] while [doing more than uniformly scaling a vector] seems tricky :wink:.
It might be worth noting that, insofar as I am aware, these performance regressions are tightly confined, specifically to operations that have (over-)aggressive specializations in 0.5 (e.g. *
and /
between pairs of numbers and sparse matrices); outside such cases, 0.6's performance should meet or best 0.5's performance.
Additionally, using dot ops directly (rather than via *
and /
) in the benchmarks above for *
and /
closes most of the gap. To illustrate, on 0.5
julia> median(@benchmark 3.0 * $A)
BenchmarkTools.TrialEstimate:
time: 161.932 ns
gctime: 0.000 ns (0.00%)
memory: 640 bytes
allocs: 4
while on 0.6
julia> median(@benchmark 3.0 .* $A)
BenchmarkTools.TrialEstimate:
time: 189.452 ns
gctime: 0.000 ns (0.00%)
memory: 640 bytes
allocs: 4
julia> median(@benchmark 3.0 * $A)
BenchmarkTools.TrialEstimate:
time: 233.724 ns
gctime: 0.000 ns (0.00%)
memory: 688 bytes
allocs: 7
Furthermore, if you make the operation even marginally more complex, 0.6 should win. To illustrate, on 0.5
julia> median(@benchmark 3.0 .* $A .* 3.0)
BenchmarkTools.TrialEstimate:
time: 336.352 ns
gctime: 0.000 ns (0.00%)
memory: 1.25 KiB
allocs: 8
while on 0.6
julia> median(@benchmark 3.0 .* $A .* 3.0)
BenchmarkTools.TrialEstimate:
time: 192.012 ns
gctime: 0.000 ns (0.00%)
memory: 640 bytes
allocs: 4
And there still exist opportunities to whittle away at any such gaps without changing semantics. Best!
using dot ops directly (rather than via * and /) in the benchmarks above for * and / closes most of the gap
This is because of the literal. With a variable, it doesn't matter. I'm not sure why the two argument broadcast
with a scalar would be slower than the one argument broadcast
.
This is because of the literal.
Yes, exactly :). Sparse broadcast
presently handles scalars by: (1) capturing scalars in a closure (rather, a series of nested closures) while collecting the non-scalar arguments; and (2) calling sparse broadcast
to apply that closure to the collected non-scalar arguments. For reasons that remain somewhat mysterious to me, introducing such closures degrades performance somewhat. Inlining the numeric literal avoids going through the closure-generating code, so using dot ops achieves better performance as shown above. (The relevant code is here, and the ins and outs of incorporating scalars via this approach were discussed at length in the pull request and linked prior threads.)
If you could shed light on the closure-associated performance issues, I would much appreciate your insight!
With a variable, it doesn't matter.
Though not in the case of s .* A
/s * A
with s = 3.0
, it does matter in even marginally more complex cases. For example, on 0.5
julia> @benchmark $s .* $A .* $s
BenchmarkTools.Trial:
memory estimate: 1.38 KiB
allocs estimate: 8
--------------
minimum time: 313.444 ns (0.00% GC)
median time: 339.560 ns (0.00% GC)
mean time: 429.646 ns (14.17% GC)
maximum time: 5.812 μs (89.75% GC)
--------------
samples: 10000
evals/sample: 252
julia> @benchmark $s * $A * $s
BenchmarkTools.Trial:
memory estimate: 1.38 KiB
allocs estimate: 8
--------------
minimum time: 319.040 ns (0.00% GC)
median time: 342.202 ns (0.00% GC)
mean time: 432.209 ns (14.11% GC)
maximum time: 6.319 μs (89.28% GC)
--------------
samples: 10000
evals/sample: 247
while on 0.6
julia> @benchmark $s .* $A .* $s
BenchmarkTools.Trial:
memory estimate: 672 bytes
allocs estimate: 7
--------------
minimum time: 266.410 ns (0.00% GC)
median time: 284.621 ns (0.00% GC)
mean time: 346.397 ns (10.20% GC)
maximum time: 5.074 μs (88.84% GC)
--------------
samples: 10000
evals/sample: 383
julia> @benchmark $s * $A * $s
BenchmarkTools.Trial:
memory estimate: 1.28 KiB
allocs estimate: 14
--------------
minimum time: 435.910 ns (0.00% GC)
median time: 467.676 ns (0.00% GC)
mean time: 576.297 ns (11.24% GC)
maximum time: 9.801 μs (80.05% GC)
--------------
samples: 10000
evals/sample: 199
I'm not sure why the two argument broadcast with a scalar would be slower than the one argument broadcast.
Were the closure mechanism above zero-cost, then the former and latter should be comparable (given the former wraps the scalar into a closure and then calls the latter.) Any insight would be much appreciated :). Best!
I am assuming that this is related, otherwise I can open a new issue, but it seems like structural zeros are not preserved in some cases, for example in scale2!
two below.
function scale1!(A,b)
A .= A.*(1.0./b)
end
function scale2!(A,b)
tmp = 1.0./b
A .= A.*tmp
end
julia> A0 = sprandn(100,100000,0.001);
julia> b = randn(100);
julia> A = copy(A0);
julia> @time nnz(scale1!(A,b))
0.433009 seconds (225.66 k allocations: 167.108 MiB, 16.55% gc time)
10000000 #Note full matrix!!!
julia> A = copy(A0);
julia> @benchmark scale1!(A,b)
BenchmarkTools.Trial:
memory estimate: 6.44 KiB
allocs estimate: 11
--------------
minimum time: 55.652 ms (0.00% GC)
median time: 56.000 ms (0.00% GC)
mean time: 56.305 ms (0.00% GC)
maximum time: 61.173 ms (0.00% GC)
--------------
samples: 89
evals/sample: 1
julia> A = copy(A0);
julia> @time nnz(scale2!(A,b))
0.034621 seconds (15.75 k allocations: 729.528 KiB)
10028
julia> A = copy(A0);
julia> @benchmark scale2!(A,b)
BenchmarkTools.Trial:
memory estimate: 8.34 KiB
allocs estimate: 36
--------------
minimum time: 13.837 ms (0.00% GC)
median time: 13.986 ms (0.00% GC)
mean time: 14.674 ms (0.00% GC)
maximum time: 21.433 ms (0.00% GC)
--------------
samples: 341
evals/sample: 1
I was not able to understand why @benchmark
reports good values when the code is slow in practice. The issue is not related to the assignment to A, a non-inplace version has the same drawback, but the preformance loss is not as significant because of the extra allocations.
A related problem, maybe even worse (if possible), is that
function scale1!(A,b)
A .*= (1.0./b)
end
will result in a full NaN matrix!
Ran on
Julia Version 0.6.0-pre.beta.397
Commit 991b581* (2017-04-28 08:52 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
Edit: Verified on latest version of julia
Julia Version 0.7.0-DEV.2
Commit b6e4e70* (2017-05-02 15:35 UTC)
with same behavior, although slightly less significant time-wise.
it seems like structural zeros are not preserved in some cases, for example in
scale2!
two below.
Did you mean scale1!
? (scale2!
preserves structural zeros, whereas scale1!
does not.) Simpler illustration of the difference in behavior:
julia> A = sprandn(10, 10, 0.1); nnz(A)
9
julia> b = randn(10);
julia> nnz(A ./ b) # like scale1!
100
julia> nnz(A .* b) # like scale2!
9
Explanation: /
is not strictly zero-preserving, i.e. 0/0 == NaN != 0
. Generic sparse broadcast
yields a dense matrix in sparse form for functions that are not strictly zero-preserving. On the other hand, *
is strictly zero-preserving, i.e. 0*0 == 0
. Generic sparse broadcast
yields a truly sparse result for functions that are strictly zero-preserving.
In essence you've hit the issues I mention in the second section of https://github.com/JuliaLang/julia/pull/18590#issuecomment-249401774 and again in https://github.com/JuliaLang/julia/pull/18590#issuecomment-249447622.
I was not able to understand why
@benchmark
reports good values when the code is slow in practice.
Not certain I follow --- could you clarify? :)
will result in a full NaN matrix!
Yes, the issue with e.g. A ./= b
is that the input arguments alias the output argument, and sparse broadcast
does not yet guard against madness ensuing in that case. (Opened an issue to track that behavior: JuliaLang/julia#21693.)
Best!
Did you mean scale1!? (scale2! preserves structural zeros, whereas scale1! does not.)
Yes, sorry.
Simpler illustration of the difference in behavior:
Sure, but part of the point was to also show that the expression A.*(1.0./b)
will not give the same result, as using a temp variable, even though it should in some sense be equivalent.
Explanation: / is not strictly zero-preserving, i.e. 0/0 == NaN != 0. Generic sparse broadcast yields a dense matrix in sparse form for functions that are not strictly zero-preserving. On the other hand, * is strictly zero-preserving, i.e. 0*0 == 0. Generic sparse broadcast yields a truly sparse result for functions that are strictly zero-preserving.
Thanks for the explanation, I was able to figure out that this is what's probably being done, but the code is not trivial to read. My understanding is that the whole resulting matrix is NaN
filled and that everything that is not a result of /
between structural zeros is calculated and updated.
Is there a good reason why we are allocating and filling the whole result with NaN
instead of only the parts corresponding to structural zeros (in this case structural zeros in b
)?
I understand the reasoning behind assuming that most elements in sparse arrays are zeros, for efficiency reasons, but then I don't think that a dense b
should be converted to be sparse and run through this method.
I was not able to understand why
@benchmark
reports good values when the code is slow in practice.
Not certain I follow --- could you clarify? :)
Sure, what I meant is that @benchmark
estimates approximately a factor 4
slowdown and a reduction of memory usage for scale1!
compared to scale2!
, but @time
is able to measure the actual slowdown of a factor 10+ and the increase in memory allocation of a factor 200+.
Might be related to benchmarking at global scope, which generally is to be avoided.
My understanding is that the whole resulting matrix is NaN filled and that everything that is not a result of / between structural zeros is calculated and updated.
This understanding is correct :).
Is there a good reason why we are allocating and filling the whole result with NaN instead of only the parts corresponding to structural zeros (in this case structural zeros in b)? I understand the reasoning behind assuming that most elements in sparse arrays are zeros, for efficiency reasons, but then I don't think that a dense b should be converted to be sparse and run through this method.
The tl;dr is that (1) these methods were primarily designed for combinations involving sparse/structured vectors/matrices and scalars, (2) the mechanisms allowing combinations including dense vectors/matrices were something of an afterthought, and (3) consequently these methods are suboptimal for combinations including dense vectors/matrices. For such combinations, substantial room for improvement exists :).
Sure, what I meant is that
@benchmark
estimates approximately a factor 4 slowdown and a reduction of memory usage for scale1! compared to scale2!, but@time
is able to measure the actual slowdown of a factor 10+ and the increase in memory allocation of a factor 200+.
Benchmarking is tricky :). Note that (1) as used above, @time
also captures compilation time and memory consumption; (2) as martinholters
mentioned, benchmarking at global scope is problematic; (3) the operations (and so timings) above are suspect due to the input-output aliasing; and (4) on the first call of scale?!(A, b)
after the A = copy(A0)
calls above, A
gets reallocated to accommodate worst-case storage needs, but such reallocation does not necessarily occur on subsequent calls (A
having previously be expanded), so @time
captures that reallocation cost whereas @benchmark
does not necessarily.
Best!
For such combinations, substantial room for improvement exists :).
I see, hopefully I will be able to find some time to contribute more to julia in the close future, I have several issues related to sparse matrices and factorization I will try to take a stab at when i find the time, and your explanations are very useful for me to understand the language better.
Benchmarking is tricky :)
Yes, I actually took several steps to create more reliable results, which I didn't show, I really think that the actual performance is closer to the one reported by @time
(although I might be wrong)
(1) as used above,
@time
also captures compilation time and memory consumption;
I did run @time
a couple of times, to avoid this and posted printed the result after a few runs
(2) as martinholters mentioned, benchmarking at global scope is problematic; (3) the operations (and so timings) above are suspect due to the input-output aliasing;
Could you elaborate? I did put the main code in functions, I thought this would remove most of the problems
(4) on the first call
See (1)
Note also that for some reason @benchmark
ran 4 times more test on sclale2!
, I'm guessing this is closely related to the time extra time it takes? Anyway, I think we should focus should be on how we can make the code better, and not on how the benchmark was preformed :)
hopefully I will be able to find some time to contribute more to julia in the close future, I have several issues related to sparse matrices and factorization I will try to take a stab at when i find the time
That would be great! :)
Could you elaborate? I did put the main code in functions, I thought this would remove most of the problems
Ref. the relevant section of the BenchmarkTools.jl manual.
Best!
Just got hit by the literal vs. variable discrepancy. This issue, along with JuliaLang/julia#21693 makes me seem that for now dotops should really just throw an error instead of silently giving the wrong result...
julia> X = sprand(10,0.1)
10-element SparseVector{Float64,Int64} with 2 stored entries:
[4 ] = 0.492868
[8 ] = 0.0962979
julia> a=0
0
julia> X./0
10-element SparseVector{Float64,Int64} with 10 stored entries:
[1 ] = NaN
[2 ] = NaN
[3 ] = NaN
[4 ] = Inf
[5 ] = NaN
[6 ] = NaN
[7 ] = NaN
[8 ] = Inf
[9 ] = NaN
[10] = NaN
julia> X./a
10-element SparseVector{Float64,Int64} with 2 stored entries:
[4 ] = Inf
[8 ] = Inf
Argh, actually this is an other issue, since the generic sparse broadcast was not done for SparseVector
, and the definition is still only applies the operation to nonzeros.
Argh, actually this is an other issue, since the generic sparse broadcast was not done for SparseVector, and the definition is still only applies the operation to nonzeros.
Generic sparse broadcast
handles SparseVector
s. Rather, the X ./ a
case in your example hits the dot-operator deprecation here, which in turn calls broadcast(/, X, A)
. broadcast(/, X, a)
hits this set of broadcast
specializations instead of generic sparse broadcast
, yielding the observed behavior.
That set of methods looks like it should be removed in the interest of correctness. I will eke out time for a pull request removing those methods this weekend if possible.
Thanks for catching those stragglers @jebej! :)
Hm how do you call the generic sparse broadcast with sparse vectors?
edit: I can manually call broadcast_c
to bypass the specializations.
Hm how do you call the generic sparse broadcast with sparse vectors?
broadcast
over combinations of sparse/structured vectors/matrices, scalars, and Vector
s/Matrix
s should hit generic sparse broadcast
(where no specializations exist for the particular combination of arguments). Please see NEWS.md for more.
I had figured that there was no broadcast for SparseVector because the methods here act only on the nonzeros. broadcast itself is hijacked to call those nonzero-only methods.
As above (https://github.com/JuliaLang/julia/issues/21515#issuecomment-313869996), those specializations are incorrect and should be removed. Best!