LoopVectorization.jl
LoopVectorization.jl copied to clipboard
Turbo precondition checks
The readme warns
We expect that any time you use the
@turbomacro with a given block of code that you:Are not indexing an array out of bounds.
@turbodoes not perform any bounds checking. Are not iterating over an empty collection. Iterating over an empty loop such as for i ∈ eachindex(Float64[]) is undefined behavior, and will likely result in the out of bounds memory accesses. Ensure that loops behave correctly. Are not relying on a specific execution order.@turbocan and will re-order operations and loops inside its scope, so the correctness cannot depend on a particular order. You cannot implement cumsum with@turbo. Are not using multiple loops at the same level in nested loops.
The docs also warn
Broadcasting an Array A when size(A,1) == 1 is NOT SUPPORTED, unless this is known at compile time (e.g., broadcasting a transposed vector is fine).
I assume turbo can't check these preconditions at runtime for performance reasons. During testing and debugging, I want to check the preconditions. Is it possible to enable these checks in a certain mode?
I count five checks. Are some of them more costly than others?
So, the five checks are...
- Bounds checks. Seems reasonable.
- Already supported. Use
@turbo check_empty=true for .... The overhead for this should be extremely small/hard to measure. - Execution order. This is harder. Two components: compile-time determinable, like
x[i] = y[i] + x[i-1]. This is a work in progress, and will come eventually. The other component is that only locally determinable at runtime, for examplex[i] = y[i] + z[i], wherex = @view(z[2:end]). This would require emitting a bunch of alias checks, making sure that the pointer toxis far enough away from the pointers toyandz. - The simplest approach here is just checking the sizes in front, and then calling a fallback routine that's slower but capable of handling it. I have an idea for making the fallback reasonably fast, but even a slow solution like FastBroadcast.jl uses would be a good improvement over crashing and should be easy to implement.
If you or someone else are willing to work on any of these, I'd be happy to provide instructions/guidance and answer any questions.
I think (5.) is "Broadcasting an Array A when size(A,1) == 1".
I wonder if it would make sense for these checks to be toggled globally rather than at each location. Or if a @safeturbo would make sense. (Or even @unsafeturbo, with @turbo being the safe one.)
I think (5.) is "Broadcasting an Array A when size(A,1) == 1".
That was 4. above. In that case, what's 4?
Maybe module-specific toggles, but I'm not a fan of global ones, where code at one location can change the behavior of code somewhere else. I agree that a single argument (with another macro as an alias) that turns all the runtime checks on would make sense.
The remaining one is, "Are not using multiple loops at the same level in nested loops."