LoopVectorization.jl
LoopVectorization.jl copied to clipboard
@turbo not doing LoopVectorization on Float64
Seems to fail for Float64 and AbstractFloat but works for real and more general types. Also, should it be able to work on structs as in the X1 example at the bottom.
X = randn(50_000_000)
float64( x::Float64 ) = sin(x)
abs_float( x::AbstractFloat ) = sin(x)
real( x::Real ) = sin(x)
any( x::Any ) = sin(x)
@turbo float64.(X) # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead
@turbo abs_float.(X) # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead
@turbo real.(X) # works
@turbo any.(X) # works
struct X1; x::Float64 end
fx1((;x)::X1) = sin(x)
x1 = repeat([X1(1.0)],1000_000)
@turbo fx1.( x1 ) # LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead
It works by passing AbstractSIMD objects to your functions.
If your signature only accepts ::Float64 or ::AbstractFloat, this will throw a MethodError. Hence, check_args fails.
This is a limitation of how the package is implemented.
Is there a way to explicitly define a function for a Float64 SIMD object. I tried the below but it doesn't work.
f( x ::LoopVectorization.VectorizationBase.AbstractSIMD{8, Float64} ) = sin(x)
vmap( f, X )
I'd like to run vmap on structs as in the example.
Since Julia is so much about structs and performance I would have thought this would be supported.
Ah, yeah, I was lazy in the implementation and just check if it works for Vec{2,Int}, not the type it will actually be used with.
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/cdd9ef1c0bb42a25e11b770d4a85c1324d9d2b7a/src/condense_loopset.jl#L997
Most of the functions are either fully generic, or only work with integers, so AFAIK no one noticed. This does, however, prevent you from being able to add restrictions like
@inline f(x::LoopVectorization.VectorizationBase.AbstractSIMD{N, Float64}) where {N} = sin(x)
Any particular motivation for not supporting Int? You may as well add a Int implementation that calls f(float(x)), or something like that.
Alternatively, make a PR that runs a naive type inference (avoiding the base type inference, which we can do due to not being generic) on LV's internal IR to infer the correct types, and use this in check_args.
This seemed like more effort than it is worth to me.
I have not had much time to work on performance lately, but my future work has all moved to LoopModels (which will not have any of these problems, due to working at a lower level).
I was trying to understand the basics. Ultimately I'd like to use vmap on StructArrays or Vectors of Structs like I can do with GPU.
In the below example, running on AMDGPU gives 100x speedup but vmap doesn't give any. I assume its not using SIMD
struct X10
a::Float64
b::Float64
end
f((;a,b)::X10) = 2sin(a)cos(a) + 3sin(b)cos(b) + sin(a*b)cos(a/b)
A = randn(10^7)
B = randn(10^7)
rA = ROCArray(A)
rB = ROCArray(B)
x10 = StructArray{X10}(a=A,b=B)
rx10 = StructArray{X10}(a=rA,b=rB);
@time f.(x10) # .4 seconds
@time vmap( f, x10 ) # .4 seconds
@time f.(rx10) # .004 seconds
More general use case. A portfolio of financial products. Thousands of trades and dozens of product types. Each product type has a Struct and a Value function which takes the struct as an input. All the trades of a product type could be stored as a StructArray. For each product type I'd like to run the valuations vectorized. GPU is fast but has other disadvantages. I was hoping vmap or @tturbo might be a good substitute.
You could get it to work with some manual effort.
Or, you can make a PR to add explicit StructArrays support by wrapping vmap.
Basically, you'd decompose the StructArray into a set of vectors, and call a wrapper function that rebuilds a struct, but with SIMD types as its fields.
Maybe it'd have to be a NamedTuple, because it can't be a X10 that only accepts scalar fields.
This would still require a reasonably generic function, i.e. it'd have to accept the NamedTuple in place of an X10, or, if X10 is parameterized to accept scalars or SIMD types, it'd have to accept both parameterizations (and the wrapping function would have to do the right thing).
Note that I no longer use Julia, nor do I maintain LoopVectorization or the JuliaSIMD ecosystem, so I'm unlikely to make any of these changes myself. I can, however, answer questions or describe approaches, so that those who do have time and stand to benefit can take over.
Any help is appreciated.
If I understand correctly you're saying any solution would involve re-writing the structs and function inputs. Funny since it works seamlessly with GPU Vectors. I guess the SIMD hardware is more restrictive. Sorry you're leaving Julia.
It's not about the hardware, but that LoopVectorization.jl's implementation is bad/limited.
This is why LoopModels would fix this.
I hope they build it (don't think I'm smart enough to help there) :)