LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

@turbo not doing LoopVectorization on Float64

Open Lincoln-Hannah opened this issue 10 months ago • 9 comments
trafficstars

Seems to fail for Float64 and AbstractFloat but works for real and more general types. Also, should it be able to work on structs as in the X1 example at the bottom.

X    = randn(50_000_000)

float64(   x::Float64       ) = sin(x)
abs_float( x::AbstractFloat ) = sin(x)
real(      x::Real          ) = sin(x)
any(       x::Any           ) = sin(x)

@turbo float64.(X)      # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead
@turbo abs_float.(X)    # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead
@turbo real.(X)         # works
@turbo any.(X)          # works

struct X1; x::Float64 end
fx1((;x)::X1) = sin(x)
x1 = repeat([X1(1.0)],1000_000)

@turbo fx1.( x1 )       # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead

Lincoln-Hannah avatar Jan 06 '25 22:01 Lincoln-Hannah

It works by passing AbstractSIMD objects to your functions. If your signature only accepts ::Float64 or ::AbstractFloat, this will throw a MethodError. Hence, check_args fails.

This is a limitation of how the package is implemented.

chriselrod avatar Jan 06 '25 23:01 chriselrod

Is there a way to explicitly define a function for a Float64 SIMD object. I tried the below but it doesn't work.

f( x ::LoopVectorization.VectorizationBase.AbstractSIMD{8, Float64} ) = sin(x)
vmap( f, X )

I'd like to run vmap on structs as in the example.

Since Julia is so much about structs and performance I would have thought this would be supported.

Lincoln-Hannah avatar Jan 06 '25 23:01 Lincoln-Hannah

Ah, yeah, I was lazy in the implementation and just check if it works for Vec{2,Int}, not the type it will actually be used with. https://github.com/JuliaSIMD/LoopVectorization.jl/blob/cdd9ef1c0bb42a25e11b770d4a85c1324d9d2b7a/src/condense_loopset.jl#L997

Most of the functions are either fully generic, or only work with integers, so AFAIK no one noticed. This does, however, prevent you from being able to add restrictions like

@inline f(x::LoopVectorization.VectorizationBase.AbstractSIMD{N, Float64}) where {N} = sin(x)

Any particular motivation for not supporting Int? You may as well add a Int implementation that calls f(float(x)), or something like that.

Alternatively, make a PR that runs a naive type inference (avoiding the base type inference, which we can do due to not being generic) on LV's internal IR to infer the correct types, and use this in check_args. This seemed like more effort than it is worth to me.

I have not had much time to work on performance lately, but my future work has all moved to LoopModels (which will not have any of these problems, due to working at a lower level).

chriselrod avatar Jan 07 '25 03:01 chriselrod

I was trying to understand the basics. Ultimately I'd like to use vmap on StructArrays or Vectors of Structs like I can do with GPU. In the below example, running on AMDGPU gives 100x speedup but vmap doesn't give any. I assume its not using SIMD

struct X10
    a::Float64
    b::Float64
end

f((;a,b)::X10) = 2sin(a)cos(a) + 3sin(b)cos(b) + sin(a*b)cos(a/b)

A   = randn(10^7)
B   = randn(10^7)
rA  = ROCArray(A)
rB  = ROCArray(B)

x10  = StructArray{X10}(a=A,b=B)
rx10 = StructArray{X10}(a=rA,b=rB);

@time f.(x10)           # .4   seconds
@time vmap( f, x10 )    # .4   seconds 
@time f.(rx10)          # .004 seconds

Lincoln-Hannah avatar Jan 07 '25 03:01 Lincoln-Hannah

More general use case. A portfolio of financial products. Thousands of trades and dozens of product types. Each product type has a Struct and a Value function which takes the struct as an input. All the trades of a product type could be stored as a StructArray. For each product type I'd like to run the valuations vectorized. GPU is fast but has other disadvantages. I was hoping vmap or @tturbo might be a good substitute.

Lincoln-Hannah avatar Jan 07 '25 03:01 Lincoln-Hannah

You could get it to work with some manual effort.

Or, you can make a PR to add explicit StructArrays support by wrapping vmap. Basically, you'd decompose the StructArray into a set of vectors, and call a wrapper function that rebuilds a struct, but with SIMD types as its fields. Maybe it'd have to be a NamedTuple, because it can't be a X10 that only accepts scalar fields. This would still require a reasonably generic function, i.e. it'd have to accept the NamedTuple in place of an X10, or, if X10 is parameterized to accept scalars or SIMD types, it'd have to accept both parameterizations (and the wrapping function would have to do the right thing).

Note that I no longer use Julia, nor do I maintain LoopVectorization or the JuliaSIMD ecosystem, so I'm unlikely to make any of these changes myself. I can, however, answer questions or describe approaches, so that those who do have time and stand to benefit can take over.

chriselrod avatar Jan 07 '25 17:01 chriselrod

Any help is appreciated.

If I understand correctly you're saying any solution would involve re-writing the structs and function inputs. Funny since it works seamlessly with GPU Vectors. I guess the SIMD hardware is more restrictive. Sorry you're leaving Julia.

Lincoln-Hannah avatar Jan 08 '25 03:01 Lincoln-Hannah

It's not about the hardware, but that LoopVectorization.jl's implementation is bad/limited.

This is why LoopModels would fix this.

chriselrod avatar Jan 08 '25 10:01 chriselrod

I hope they build it (don't think I'm smart enough to help there) :)

Lincoln-Hannah avatar Jan 08 '25 21:01 Lincoln-Hannah