Kristoffer Carlsson comments

Results 1634 comments of


                                            Kristoffer Carlsson

Don't force loop unrolling in reductions

Updated. Some benchmarks made on https://github.com/JuliaLang/julia/pull/29258 ```jl using BenchmarkTools using StaticArrays x = rand(MMatrix{8,8}); s = rand(SMatrix{8,8}); ``` Before ```jl julia> @btime map!(x -> x*2, $x, $s); 11.056 ns (0...

Don't force loop unrolling in reductions

Arguably the old methods should be left and we should `@static if` based on the VERSION... Thoughts?

Don't force loop unrolling in reductions

```jl using BenchmarkTools using StaticArrays using LinearAlgebra for siz in (1,2,3,4,8) println("size = $siz x $siz") # Refs to avoid inlining into benchmark loop s = Ref(rand(SMatrix{siz, siz})) @btime sum(abs2,...

Don't force loop unrolling in reductions

Removed the `map!` loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

Various functions don't have cutoffs for when to stop unrolling

> The escape analysis and codegen already work perfectly well, I’d just want conversions between mutable and immutable tuples to be no-ops when appropriate (as you want to store your...

circshift(SArray) returns MArray

Note that it is possible to write something like ```jl function matmul(a::SMatrix{I, J}, b::SMatrix{J, K}) where {I, J, K} c = zero(MMatrix{I, K}) @inbounds for k in 1:K, j in...

3x3 matrix multiply could potentially be faster

Not sure what the best thing is to do. FWIW, this just seems to be the SLP doing a bad job vectorizing the code, would be interesting to write the...

3x3 matrix multiply could potentially be faster

Seems clang kinda barfs on e.g. a 3x3 pattern as well: https://godbolt.org/z/TGCusg. It is interesting to note that when the sizes correspond to the width of the SIMD registers the...

3x3 matrix multiply could potentially be faster

I always saw not using `muladd` in StaticArrays as just a missed optimization. Are you saying it is an intended choice? Seems quite out of spirit with other choices made...

3x3 matrix multiply could potentially be faster

Apparently, LLVM has matrix multiplication intrinsics now: https://llvm.org/docs/LangRef.html#llvm-matrix-multiply-intrinsic