LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

Turbo precondition checks

Open jtrakk opened this issue 4 years ago • 4 comments
trafficstars

The readme warns

We expect that any time you use the @turbo macro with a given block of code that you:

Are not indexing an array out of bounds. @turbo does not perform any bounds checking. Are not iterating over an empty collection. Iterating over an empty loop such as for i ∈ eachindex(Float64[]) is undefined behavior, and will likely result in the out of bounds memory accesses. Ensure that loops behave correctly. Are not relying on a specific execution order. @turbo can and will re-order operations and loops inside its scope, so the correctness cannot depend on a particular order. You cannot implement cumsum with @turbo. Are not using multiple loops at the same level in nested loops.

The docs also warn

Broadcasting an Array A when size(A,1) == 1 is NOT SUPPORTED, unless this is known at compile time (e.g., broadcasting a transposed vector is fine).

I assume turbo can't check these preconditions at runtime for performance reasons. During testing and debugging, I want to check the preconditions. Is it possible to enable these checks in a certain mode?

I count five checks. Are some of them more costly than others?

jtrakk avatar Jun 19 '21 21:06 jtrakk

So, the five checks are...

  1. Bounds checks. Seems reasonable.
  2. Already supported. Use @turbo check_empty=true for .... The overhead for this should be extremely small/hard to measure.
  3. Execution order. This is harder. Two components: compile-time determinable, like x[i] = y[i] + x[i-1]. This is a work in progress, and will come eventually. The other component is that only locally determinable at runtime, for example x[i] = y[i] + z[i], where x = @view(z[2:end]). This would require emitting a bunch of alias checks, making sure that the pointer to x is far enough away from the pointers to y and z.
  4. The simplest approach here is just checking the sizes in front, and then calling a fallback routine that's slower but capable of handling it. I have an idea for making the fallback reasonably fast, but even a slow solution like FastBroadcast.jl uses would be a good improvement over crashing and should be easy to implement.

If you or someone else are willing to work on any of these, I'd be happy to provide instructions/guidance and answer any questions.

chriselrod avatar Jun 19 '21 21:06 chriselrod

I think (5.) is "Broadcasting an Array A when size(A,1) == 1".

I wonder if it would make sense for these checks to be toggled globally rather than at each location. Or if a @safeturbo would make sense. (Or even @unsafeturbo, with @turbo being the safe one.)

jtrakk avatar Jun 19 '21 22:06 jtrakk

I think (5.) is "Broadcasting an Array A when size(A,1) == 1".

That was 4. above. In that case, what's 4?

Maybe module-specific toggles, but I'm not a fan of global ones, where code at one location can change the behavior of code somewhere else. I agree that a single argument (with another macro as an alias) that turns all the runtime checks on would make sense.

chriselrod avatar Jun 19 '21 23:06 chriselrod

The remaining one is, "Are not using multiple loops at the same level in nested loops."

jtrakk avatar Jun 20 '21 00:06 jtrakk