Andres Nowak comments

Results 25 comments of


                                            Andres Nowak

Optimized rope_rotation_llama and apply temperature to logits with vectorization

Not really, at least when testing with the 15m and 110m models i didn't see a difference or the difference was to little, probably the matmul operation just takes the...

Optimized rope_rotation_llama and apply temperature to logits with vectorization

hmm interesting, but shouldn't vectorized be faster with a 1000 values already, or is strided load and store not as efficient because it has to access to separated parts in...

Optimized rope_rotation_llama and apply temperature to logits with vectorization

Regarding the parallelize part, Why would it be faster to remove it, isn't parallelize supposedly using a cached runtime?, so it shouldn't be creating threads each time it is called...

Optimized rope_rotation_llama and apply temperature to logits with vectorization

Doing some tests in a 3600x @mikowals, i only saw a difference in the 15m model when comparing rope with parallelization and without it, for 110m and tiny_llama i didn't...

Optimized rope_rotation_llama and apply temperature to logits with vectorization

Maybe i didn't understand correctly, in these benchmarks you are comparing the original implementation and the rope implementation with simd and without parallelization no?, what i was comparing was rope...

Optimized rope_rotation_llama and apply temperature to logits with vectorization

@mikowals Doing the lamatune and doing your isolated benchmarks I have found the same conclusion as you, simple vanilla for loops are faster than parallelize, vectorize and vectorize parallelize, for...

Optimized rope_rotation_llama and apply temperature to logits with vectorization

Hmm, but vectorize also uses for loops no?, so it should have the same optimizations no?, or is it not applying the same optimizations

Optimized rope_rotation_llama and apply temperature to logits with vectorization

hhmmm, okay, I understand.

Changed vectorize function for tile (with a nelts list) in batch_matmul

In my machine, using **4 x simdwidthof** gives a **simd_width value of 32**, so I don't think the speedup is related with the first entry of nelts_list (32), but it...

Changed vectorize function for tile (with a nelts list) in batch_matmul

No I also change that value, if I change the nelts list to 64 I put the stack size to 64 and the same code that works in the amd...