IntelVectorMath.jl icon indicating copy to clipboard operation
IntelVectorMath.jl copied to clipboard

Scalar Calculation?

Open aminya opened this issue 5 years ago • 2 comments

Now the library only supports doing a calculation on an Array and also returns an Array.

It may be worth while to define scalar methods too.

julia> IVM.sin(1.1)
ERROR: MethodError: no method matching sin(::Float64)
You may have intended to import Base.sin
Closest candidates are:
  sin(::Array{Float32,N} where N) at C:\Users\yahyaaba\.julia\packages\IntelVectorMath\Gb348\src\setup.jl:72
  sin(::Array{Float64,N} where N) at C:\Users\yahyaaba\.julia\packages\IntelVectorMath\Gb348\src\setup.jl:72
Stacktrace:
 [1] top-level scope at none:0

This way we only use Intel for calculating one scalar number, which (if possible) helps to fuse for-loops with broadcasted functions and use @avx or @simd features of Julia instead for parallelization.

We should see if Intel provides scalar API. Because if it only provides Vector API, and the function call uses the Vector Processor Unit of the CPU, we cannot parallelize the function. This is like vectorizing an already vectorized function (although having a size of 1), which doesn't have an effect.

Related to https://github.com/JuliaMath/IntelVectorMath.jl/issues/43, which can help to implement the 3rd macro.

This can also solve https://github.com/JuliaMath/IntelVectorMath.jl/issues/22, by using Intel-only for a scalar call and provide an SVML like behavior using @avx or @simd.

Places to look into:

  • https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-intrinsics-for-short-vector-math-library-operations
  • https://software.intel.com/sites/landingpage/IntrinsicsGuide/#
  • http://openpowerfoundation.org/wp-content/uploads/resources/Vector-Intrinsics-4/content/sec_packed_vs_scalar_intrinsics.html
  • https://stackoverflow.com/questions/37290544/is-there-an-intrinsic-instruction-for-resulti-ak-sinbk-ci-dk

aminya avatar Mar 04 '20 00:03 aminya

Intriguing. I suppose a few tests would be necessary to see if the speed is comparable with base.

Crown421 avatar Mar 04 '20 01:03 Crown421

Intriguing. I suppose a few tests would be necessary to see if the speed is comparable with base.

We should see if Intel provides scalar API. Because if it only provides Vector API, and the function call uses the Vector Processor Unit of the CPU, we cannot parallelize the function. This is like vectorizing an already vectorized function (although having a size of 1), which doesn't have an effect.

aminya avatar Mar 04 '20 01:03 aminya