Chris Elrod

Results 840 comments of Chris Elrod

I would like to add support for this as well by defining a pullback.

I fixed the bug, but it still has trouble with 8 nested loops: ```julia julia> using LoopVectorization julia> Q = rand(3,3); julia> C = rand(3,3,3,3); julia> Crot = zeros(3,3,3,3); julia>...

I can reproduce the wrong answer, but it seems to be a problem in Tullio: ```julia julia> function rot2_tullio!(Crot, Q, C) @tullio avx=false mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]...

Isn't it, for all in one nested loop (unrolling inside like I did above doesn't really help runtime, just compile time): `5 * 3 ^ 8 = 32805` floating point...

I think LoopVectorization is making a bad decision above (the decision looks bad to me, anyway). I'm looking into why now. EDIT: Nevermind, the decision makes sense. =/

LoopVectorization completely unrolls it and uses SIMD, but it seems that is slower. LoopVectorization's assembly features a huge number of address calculations. SIMD-ing the first multiplication isn't especially efficient either,...

> BTW, am I right to think that the transformation here from the naiive loops to the staged algorithm is way out of scope for these automatic transformations? It would...

As of LoopVectorization 0.12.40, compile times should be much better here.

It could be nice to clean code up, and set up automated performance testing so I don't accidentally cause regressions. Any chance you can support a way of defining GFLOPS...

[Here](https://github.com/chriselrod/LoopVectorization.jl/blob/master/benchmarks/driver.jl#L68) is an example, although that function should be called `gflop_gemm`. I defined gflops manually for each, so doing what I wanted would just require either passing an anonymous function...