Andres Nowak

Results 25 comments of Andres Nowak

Not really, at least when testing with the 15m and 110m models i didn't see a difference or the difference was to little, probably the matmul operation just takes the...

hmm interesting, but shouldn't vectorized be faster with a 1000 values already, or is strided load and store not as efficient because it has to access to separated parts in...

Regarding the parallelize part, Why would it be faster to remove it, isn't parallelize supposedly using a cached runtime?, so it shouldn't be creating threads each time it is called...

Doing some tests in a 3600x @mikowals, i only saw a difference in the 15m model when comparing rope with parallelization and without it, for 110m and tiny_llama i didn't...

Maybe i didn't understand correctly, in these benchmarks you are comparing the original implementation and the rope implementation with simd and without parallelization no?, what i was comparing was rope...

@mikowals Doing the lamatune and doing your isolated benchmarks I have found the same conclusion as you, simple vanilla for loops are faster than parallelize, vectorize and vectorize parallelize, for...

Hmm, but vectorize also uses for loops no?, so it should have the same optimizations no?, or is it not applying the same optimizations

In my machine, using **4 x simdwidthof** gives a **simd_width value of 32**, so I don't think the speedup is related with the first entry of nelts_list (32), but it...

No I also change that value, if I change the nelts list to 64 I put the stack size to 64 and the same code that works in the amd...