Tullio.jl icon indicating copy to clipboard operation
Tullio.jl copied to clipboard

How to fuse multiple Tullio statements?

Open D3MZ opened this issue 5 months ago • 1 comments

First of all, I love Tullio. It is very magical. I’ve reduced my codebase by 3×, it works on both CPU and GPU, and it also made it faster!

However, I’m still struggling to wrap my head around Tullio’s multi-line syntax.

How does one fuse multiple Tullio statements? This issue arises when writing for a GPU because every Tullio line is another CPU call.

For example, the code below calculates the means and standard deviations as two separate calls over the same window w, but ideally this could be done entirely in a single call and output a matrix:

T = CUDA.rand(1000)
w = 100
means(T, w) = @tullio μ[i] := T[i + k - 1] / w (k in 1:w)
stds(T, w, μ) = @tullio σ[i] := sqrt <| (T[i + k - 1] - μ[i])^2 / w (k in 1:w)

D3MZ avatar Jun 26 '25 16:06 D3MZ

Thanks I'm glad if it's useful! And sorry I missed this earlier.

The short answer is that you don't. @tullio is really not very smart, the macro always makes exactly one set of nested loops. And it's just smart enough to pass those to KernelAbstractions.jl to run on a GPU, with no optimisations. Maybe it's worth looking a bit at @macroexpand1 @tullio μ[i] := T[i + k - 1] / w (k in 1:w) grad=false to see what it's doing.

For this example, you need to finish looping over all k to have μ[1], and only then can start the second loop over k to find σ[1]. Tullio explicitly doesn't allow this, there cannot be 2 innermost loops.

Maybe you know this, but for the standard (un-windowed) mean and standard deviation, there are 1 pass algorithms, which accumulate both as they go along. IIRC you have to be a little careful about accumulating errors but it can be done. I'm sure something like that could be done for the windowed case too.

mcabbott avatar Jul 30 '25 02:07 mcabbott