julia icon indicating copy to clipboard operation
julia copied to clipboard

`@fastmath` support for `sum`, `prod`, `extrema` and `extrema!`

Open matthias314 opened this issue 2 years ago • 14 comments

This PR extends #48153 by adding @fastmath support for sum, prod, extrema and extrema!. This would provide a solution to #47216.

Currently, @fastmath has no effect for the functions mentioned above. With this PR the speedup factor is

Function Float16 Float32 Float64
sum(v) 1.0 1.0 1.0
sum(v; init=T(0)) 28.9 23.8 14.2
prod(v) 1.0 1.0 1.0
prod(v; init=T(1)) 29.1 22.5 36.1
extrema(v) N/A 38.2 20.4
extrema(v; init=(T(1),T(0))) N/A 40.9 21.1
extrema!(rc, a) N/A 1.2 1.1
extrema!(rr, a) N/A 21.0 7.3

Here T is the given float type, n = 1000, v = rand(T, n), a = rand(T, n, n), rc = fill((T(0),T(0)), n) and rr = fill((T(0),T(0)), 1, n). At present, the extrema functions crash for Float16 because of #49907. For the same reason, the corresponding tests are currently skipped in the test file. I have also tried fast versions of sum! and prod!, but I couldn't get any improvement there.

matthias314 avatar May 21 '23 21:05 matthias314

A @fastmath version is IMO not a solution to the non-@fastmath version being slower than necessary - the assumptions violated by @fastmath need to hold for the regular versions after all.

Since this is quite hardware dependent, can you share your benchmarking code so others can corroborate the speedups?

Seelengrab avatar May 21 '23 22:05 Seelengrab

Could one always use @fastmath for sum and prod?. When I tried it out, the fast versions dealt with NaN correctly. Rearranging terms should be allowed for sum and prod anyway.

Here is the benchmark code
using BenchmarkTools, Printf

TT = (Float16, Float32, Float64)

n = 1000

println("| T | ", join(TT, " | "), " |")

@printf "| sum(v) |"
for T in TT
    v = rand(T, n)
    a = minimum(@benchmark sum($v))
    b = minimum(@benchmark @fastmath sum($v))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| sum(v; init=T(0)) |"
for T in TT
    v = rand(T, n)
    i = zero(T)
    a = minimum(@benchmark sum($v; init=$i))
    b = minimum(@benchmark @fastmath sum($v; init=$i))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v) |"
for T in TT
    v = rand(T, n)
    a = minimum(@benchmark prod($v))
    b = minimum(@benchmark @fastmath prod($v))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v; init=T(1)) |"
for T in TT
    v = rand(T, n)
    i = one(T)
    a = minimum(@benchmark prod($v; init=$i))
    b = minimum(@benchmark @fastmath prod($v; init=$i))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    v = rand(T, n)
    a = minimum(@benchmark extrema($v))
    b = minimum(@benchmark @fastmath extrema($v))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v; init=(T(1),T(0))) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    v = rand(T, n)
    i = (one(T), zero(T))
    a = minimum(@benchmark extrema($v; init=$i))
    b = minimum(@benchmark @fastmath extrema($v; init=$i))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rc, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    A = rand(T, n, n)
    ra = fill((T(0),T(0)), n)
    rb = copy(ra)
    a = minimum(@benchmark extrema!($ra, $A))
    b = minimum(@benchmark @fastmath extrema!($rb, $A))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rr, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    A = rand(T, n, n)
    ra = fill((T(0),T(0)), 1, n)
    rb = copy(ra)
    a = minimum(@benchmark extrema!($ra, $A))
    b = minimum(@benchmark @fastmath extrema!($rb, $A))
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

matthias314 avatar May 21 '23 22:05 matthias314

A @fastmath version is IMO not a solution to the non-@fastmath version being slower than necessary

:point_up:

giordano avatar May 21 '23 23:05 giordano

Could one always use @fastmath for sum and prod?. When I tried it out, the fast versions dealt with NaN correctly.

If that were possible, we'd already do that. It's not just NaN though - naive use of @fastmath (especially on things like reductions) can lead to horrible numerical accuracy, due to the reassociations that may not be numerically good. I'd wager that using @fastmath by default breaks this, for example, but there are bound to be others.

Rearranging terms should be allowed for sum and prod anyway.

Yeah, but since floating point values are not associative, the reassociations may lead to catastrophic cancellation and subsequent loss of accuracy. It "works" for sum and prod because those are (on x86 anyway) already doing the correct thing and SIMD, due to intel actually implementing addition correctly (and this is also why you don't see a speedup for these two). For other things like min, intel got lazy and thought they could save some hardware ressources by only propagating the right operand if either one is a NaN (which is incorrect - NaN should be propagated either way).

That's why @fastmath is opt-into, why the global flag became a no-op and as well as why there's discussion to further expose the LLVM flags it uses under the hood.

The issue with https://github.com/JuliaLang/julia/issues/47216 is more so with figuring out why exactly our implementation with a default value doesn't end up emitting vectorinstructions, rather than slapping @fastmath on it and calling it a day. It should have enough information to do that after all.

Seelengrab avatar May 22 '23 05:05 Seelengrab

I've modified your benchmarking script a bit, to ensure each evaluation starts on fresh & newly generated data, so that there's no chance of caching effects arbitrarily inflating some numbers. It's quite unlikely that repeatedly summing/multiplying the same array over and over again is a regular occurence.

I've also noticed that n = 1000 is quite small - that's just 2/4/8 KB of data, more than enough to comfortably fit in my L1 cache (384KiB) multiple times over.

using BenchmarkTools, Printf

TT = (Float16, Float32, Float64)

n = 1000

println("| T | ", join(TT, " | "), " |")

@printf "| sum(v) |"
for T in TT
    a = minimum(@benchmark sum(v) setup=(v=rand($T,n)) evals=1)
    b = minimum(@benchmark @fastmath(sum(v)) setup=(v=rand($T,n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| sum(v; init=T(0)) |"
for T in TT
    a = minimum(@benchmark sum(v; init=i) setup=(v=rand($T,n); i=rand($T)) evals=1)
    b = minimum(@benchmark @fastmath(sum(v; init=i)) setup=(v=rand($T,n); i=rand($T)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v) |"
for T in TT
    a = minimum(@benchmark prod(v) setup=(v=rand($T,n)) evals=1)
    b = minimum(@benchmark @fastmath(prod(v)) setup=(v=rand($T,n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| prod(v; init=T(1)) |"
for T in TT
    a = minimum(@benchmark prod(v; init=i) setup=(v=rand($T,n); i=rand($T)) evals=1)
    b = minimum(@benchmark @fastmath(prod(v; init=i)) setup=(v=rand($T,n); i=rand($T)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema(v) setup=(v=rand($T,n)) evals=1)
    b = minimum(@benchmark @fastmath(extrema(v)) setup=(v=rand($T,n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema(v; init=(T(1),T(0))) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema(v; init=i) setup=(v=rand($T,n); i=(rand($T),rand($T))) evals=1)
    b = minimum(@benchmark @fastmath(extrema(v; init=i)) setup=(v=rand($T,n); i=(rand($T),rand($T))) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rc, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema!(ra, A) setup=(A=rand($T,n,n); ra=fill(($T(0),$T(0)), n)) evals=1)
    b = minimum(@benchmark @fastmath(extrema!(rb, A)) setup=(A=rand($T,n,n); rb=fill(($T(0),$T(0)), n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

@printf "| extrema!(rr, a) |"
for T in TT
    if T == Float16 ; print(" N/A |") ; continue end
    a = minimum(@benchmark extrema!(ra, A) setup=(A=rand($T,n,n); ra=fill(($T(0),$T(0)), 1, n)) evals=1)
    b = minimum(@benchmark @fastmath(extrema!(rb, A)) setup=(A=rand($T,n,n); rb=fill(($T(0),$T(0)), 1, n)) evals=1)
    @printf " %.1f |" (ratio(a.time, b.time))
end
@printf "\n"

This is the result:

Function Float16 Float32 Float64
sum(v) 1.0 1.0 1.0
sum(v; init=T(0)) 25.3 11.0 9.3
prod(v) 1.0 1.0 1.0
prod(v; init=T(1)) 26.9 9.8 8.3
extrema(v) N/A 60.6 32.9
extrema(v; init=(T(1),T(0))) N/A 42.8 33.1
extrema!(rc, a) N/A 1.1 1.1
extrema!(rr, a) N/A 82.7 31.0

My machine is

Julia Version 1.10.0-DEV.1348
Commit f8f236b4bc (2023-05-21 21:16 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × AMD Ryzen 9 7900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 34 on 24 virtual cores
Environment:
  JULIA_PKG_USE_CLI_GIT = true

I especially want to note that I get worse speedup than you on prod(v; init=T(1)) (and indeed, worse than I'd expect of a proper SIMD implementation). Another thing to keep in mind is that rand(Float64) doesn't produce values on the entire range of floating point values, but only in [0, 1), so there's no chance at all of floating point precision being lost (which definitely must be taken into consideration when talking about general sum/prod and needs to be tested).

Seelengrab avatar May 22 '23 06:05 Seelengrab

Thanks for the detailed explanations! I didn't worry about floating point precision because my understanding was (and is) that also the current implementations of sum and prod add up elements in a way that only depends on the length of the input, not on the actual values. Whatever way is choosen, there will always be some input on which it performs poorly.

I especially want to note that I get worse speedup than you on prod(v; init=T(1)) (and indeed, worse than I'd expect of a proper SIMD implementation).

The speedup for prod(v; init=T(1)) makes it as fast as prod(v) on my machine. Do you think one can get faster than that?

Anyway, how shall we move forward? If you don't want @fastmath for sum and prod, then we can either forget about the PR or restrict it to extrema and extrema!.

matthias314 avatar May 22 '23 18:05 matthias314

the current implementations of sum and prod add up elements in a way that only depends on the length of the input, not on the actual values.

What does this mean? That sum([1.0, 2.0]) == sum([3.0, 4.0])?

giordano avatar May 22 '23 18:05 giordano

I mean that sum([x1,x2,x3]) boils down to a fixed evaluation order like (x1+x2)+x3. If I know the evaluation order, I can cook up a bad input. Changing the evaluation to x1+(x2+x3) simply permutes what the bad inputs are. Given that sum(v; kw...) essentially calls reduce(+, v; kw...), I don't expect there to be any (potentially existing) clever evaluation in our code that minimizes the total number of bad inputs. Please correct me if I'm wrong.

matthias314 avatar May 22 '23 18:05 matthias314

The speedup for prod(v; init=T(1)) makes it as fast as prod(v) on my machine. Do you think one can get faster than that?

I'm not at the computer I measure the previous results at right now, so I'll have to defer until then. I don't know whether prod(v) already hits peak-FLOPS, but all things being equal, I would have expected to see at least a bigger speedup for Float32 compared to Float16 and Float64, since the avx512 instructions cleanly double in capacity if you half the bitwidth of floating point.

Given that sum(v; kw...) essentially calls reduce(+, v; kw...), I don't expect there to be any (potentially existing) clever evaluation in our code that minimizes the total number of bad inputs. Please correct me if I'm wrong.

We don't currently do anything like kahan summation by default I don't think, mostly due to performance reasons, but at the same time we also don't guarantee that we don't do that. Additionally, we don't guarantee not to thread/parallelize such a summation by default, which would change associations.

Moreover, by using different blocking strategies, you can VASTLY improve performance of even slightly more complicated functions than just + (or even that) due to caching effects, better register/value reuse, fewer cache evictions etc. etc. As such, while I'm not in principle opposed to @fastmath sum, I don't think it's necessarily a good/easy win to strive for - if we wan't vectorized regular sum, @fastmath won't help with that, and recommending using @fastmath instead of fixing the issue with init seems backwards, seeing as we otherwise discourage its use unless it's known to be appropriate (which most naive use really isn't). If you truly want to squeeze out performance, there's just no way around doing something smarter than doing NaN/Inf-ignoring, reassociating reduce(+, ...) (especially on multidimensional data).

Seelengrab avatar May 22 '23 21:05 Seelengrab

A @fastmath version is IMO not a solution to the non-@fastmath version being slower than necessary

I don't see it, at least in this version, but shouldn't all calculations be done in Float64?

I mean for the lower precision Float32 etc. converted to Float64, summed, then converted back to the old type?

I believe Float64 is mandated by IEEE; would all CPU hardware have as many such float units? With SIMD, you work on registers of a certain size, and you can sum as many such, but they are going to hold fewer Float64, so maybe this is actually slower even on CPUs.

Since this is quite hardware dependent

Yes, at least on GPUs where I'm more worried: you do not have as many Float64 units (or none at all, is that still a problem, you didn't always have such historically). Julia doesn't strictly support GPUs so possibly not a problem, though CUDA 10.1 is listed as Tier 1. I think that means that package will not be broken. Maybe such a change would, or make it slower (not on CUDA 10.1?), but it could at least be updated to work.

I was looking at the discourse post:

Julia currently uses a pairwise summation algorithm, which is much better than naive left-to-right reduction while having very nearly the same speed. There is also a sum_kbn function

I.e. you would get more accuracy out of the "pairwise summation algorithm", for sum, never rounded. Is it always allowed to do better? I mean faster is ok, but for sum, more accurate, non-bit-identical is ok?

PallHaraldsson avatar May 23 '23 10:05 PallHaraldsson

I'd rather not incentivize people to use more @fastmath.

sum is not documented to use any particular association order, and in the huge majority of applications the user likely doesn't care. Further, ?reduce (whose family I consider sum to be a part of) explicitly documents that "The associativity of the reduction is implementation dependent." If someone insists on a particular evaluation order, they should use something explicit like foldl(+,x) instead. So I would argue that we currently have some flexibility here.

I think that very few users would care about @fastmath sum if we could ensure that sum (and mapreduce, more broadly) used the reassoc fastmath flag on the reduction. Most of the speedup (and almost all of the mostly-safe speedup) comes from SIMD enabled by reassoc. The remaining flags are prone to causing trouble (e.g., #49387). See also #49890.

I'm not opposed to @fastmath being able to penetrate mapreduce and friends, but we should be sure that reassociation is permitted even without it (#48129). That way people won't be so tempted to apply @fastmath. Hopefully, that means people will do it less unless they really think it's safe and important due to other operations within the reduction.

mikmoore avatar May 23 '23 17:05 mikmoore

Bump, was approved so merge? I didn't read carefully possible objections. Was this forgotten?

PallHaraldsson avatar Nov 25 '23 17:11 PallHaraldsson

This was bumped last November, bumping again.

I also see no clear objection, though the last comment by @mikmoore seems to be critical, but I am not sure if it is meant as an actual objection to this PR. Perhaps @mikmoore would like to clarify?

fingolfin avatar Feb 13 '24 23:02 fingolfin

On review, I would say that I am against @fastmath for sum/prod but accepting of extrema. Discussion follows:

There isn't a performance reason for @fastmath to be applicable to sum/prod (with identity as the predicate -- it could accelerate some actual function but that's a different question). Any speedup there is purely due to reassociation (or UB, if the compiler knows some inputs are invalid and decides to eliminate whole operations). Reassociation is already semantically permitted (and used when init isn't specified, which is why that case gets no speedup), the issue is just that the implementation with init fails to use it. Fixing init is the solution, not @fastmath.

extrema does have a reasonable performance reason for @fastmath (permitting it to mis-handle NaN and the difference between -0.0 and +0.0 can make it much faster on most architectures due to the available instructions). Much of the speedup suggested here would be recovered by #45581, but that PR is not quite finished and I haven't touched it in over a year now so it's hard to say this should wait indefinitely on that. @fastmath would still be faster than that PR could ever achieve (thanks to the aforementioned mis-handling). Also, (it appears?) we already have @fastmath minimum/maximum, so this would be consistent with those.

mikmoore avatar Feb 14 '24 00:02 mikmoore