LoopModels
LoopModels copied to clipboard
feature request: reduction isomorphisms to factor expensive operations out of the loop
It might be a bit too early to be submitting feature requests, but I recently started toying with LoopModels and thought it would not be a bad idea to copy over some of the LoopVectorizations feature requests that are important to me. Hopefully when this project is mature enough I would be able to contribute actual PRs.
Namely, the transformation mapreduce(fₐ, opₐ, arr) into fᵦ(reduce(opᵦ, arr)) can be extremely useful for increasing performance by taking the elementwise function f and factoring it out of the loop (maybe with some transformation applied to it).
- In classical error correcting codes and quantum Clifford circuits it is about factoring out
count_onesover the xor operator. - In probability theory /stats it is about using
log(x*y)=log(x)+log(y) - It seems it is also useful in graph theory and GraphBLAS has the arithmetic semiring (+, *) replaced with an arbitrary one like (max, +) for instance.
References:
- https://github.com/JuliaSIMD/LoopVectorization.jl/issues/291
- https://github.com/JuliaLinearAlgebra/Octavian.jl/issues/102
- https://discourse.julialang.org/t/how-to-do-simd-code-with-wide-register-accumulators-simd-vs-loopvectorization-jl-vs-simd-jl/63322
I agree that it's an important optimization for us to implement. And it also isn't one a user writing code can easily implement themselves, either. Taking this example:
In probability theory /stats it is about using
log(x*y)=log(x)+log(y)
The transform log(x) + log(y) -> log(x*y) will generally lead to a loss of accuracy in calculating the product; if the vector is long, it's likely to go to 0 or +/-Inf, so normally people recommend calculating things like log determinants of triangular matrices via summing the logs of the diagonal elements, rather than log-ing the product.
However, log(x) is generally computed via the transform log(x) = log(x/2^k) + k*log(2).
k is chosen to place x/2^k within the interval [0.75, 1.5).
By being restricted to this narrow range around 1, it is suddenly very unlikely that we can get to 0 or Inf.
julia> log(floatmax(Float64))/log(1.5) # how many in a row
1750.5395623438897
julia> log(floatmin(Float64))/log(0.75)
2462.4280981255797
This means it'd take about 1.5^1750 or 0.75^2462 before we run into floatmax/floatmin.
It's still possible, but that'd have to be a crazy (improbable) distribution of numbers for every one of them to be at the same side of the extreme. With an even mix, we're basically safe.
This means that we just need to sum up the ks, and take the product of the x/(2^k) values. Calculating k and x/2^k is extremely cheap, AVX512 even has specialized instructions for doing it:
https://github.com/JuliaSIMD/SLEEFPirates.jl/blob/e8316a63b1c59f3a80ba69ae0574f8e8d1c65039/src/log.jl#L284-L285
So this would still have the benefit of letting us hoist almost the entirety of the expensive log calculation out of the loop, but with negligible loss of accuracy or 0/Infs that normally result.
Related (on the topic of reduction transforms): https://github.com/JuliaSIMD/LoopVectorization.jl/issues/370 It should be able to transform such a loop nest into multiple, less deeply nested loops (achieving an algorithmic improvement in runtime).