Vectorized interface

Open ivan-pi opened this issue 2 years ago • 1 comments

It could be nice to support a vector-instruction-friendly interface. There is an open issue for this in https://github.com/scipy/scipy/issues/7242. The purpose is to solve a family of problems:

$$f(x_k,p_k) = 0, \qquad k = 1, ..., N$$

Ideally, the $p_k$ are somehow related in their physical origin (e.g. a spatial field), so the convergence behavior locally in terms of $k$ will be similar. For root-finding on multi-dimensional parameter grids, e.g. $p_{j,k}$, one can use the Fortran pointer-remapping feature to make the problems 1-D.

I did some benchmarking of zeroin recently (discussion here). The code runs in scalar mode for the most part. The complex control flow prevents vectorization (view on godbolt; code taken from netlib):

$ gfortran-13 -c -O3 -march=haswell -fopt-info-missed zeroin.f
zeroin.f:124:72: missed:   not inlinable: zeroin/0 -> __builtin_copysign/1, function body not available
zeroin.f:60:72: missed: couldn't vectorize loop
zeroin.f:60:72: missed: not vectorized: control flow in loop.
zeroin.f:62:10: missed: couldn't vectorize loop
zeroin.f:62:10: missed: not vectorized: control flow in loop.
zeroin.f:125:72: missed: statement clobbers memory: fb_113 = f_86(D) (&b);
zeroin.f:53:72: missed: statement clobbers memory: fa_88 = f_86(D) (&a);
zeroin.f:54:72: missed: statement clobbers memory: fb_90 = f_86(D) (&b);
zeroin.f:125:72: missed: statement clobbers memory: fb_113 = f_86(D) (&b);

One may need to write the implementation in terms of "chunks" with reductions, which would then map to SIMD registers.

I'm not sure how the vector callback interface would look like, and what is the natural way to pass the parameters in a way that makes it SIMD-friendly:

integer, parameter :: vlen = 4  ! or 8 (depending on real kind and instruction set)

! Explicit length
function vf(x,k)
   real, intent(in) :: x
   integer, intent(in) :: k
   real :: vf(vlen)

   ! using loop
   !$omp simd
   do j = 1, vlen
      vf(j) = ... ! F(x,p(k + j)), p imported from host scope
   end do

   ! or array expressions
   vf = ... ! F(x,p(k:k+vlen)), p imported from host scope
end function

! Scalar callback vectorized using OpenMP
function f(x,k)
!$omp declare simd uniform(x) linear(k: 1)
  real, intent(in) :: x
  integer, intent(in) :: k   ! index i needed to index into parameter array
  real :: f
  f = ... ! F(x,p(k)), p imported from host scope
end function

I suppose you could also do a callback of the form:

subroutine root_callback(x,lb,ub,y)
real, intent(in) :: x
integer, intent(in) :: lb, ub  ! indexes needed to reference into parameter array
real, intent(out) :: y(lb:ub)

! user writes the loop
!$omp simd
do k = lb, ub
  y(k) = ... ! F(x,p(k)), p imported from host scope
end do

! or using array expressions
associate (p => params(lb:ub))
y = ... ! F(x,p), params imported from host scope
end associate

end subroutine

This way the SIMD length could be left to the program logic, and it would make it easier to handle peel/remainder loops. A more Fortranic way would be to use do concurrent (assuming compilers will do the right thing).

Sep 04 '23 10:09 ivan-pi

I've made a small example on Godbolt, using the function:

$$f(x,p_k) = x \sin(x) - p_k$$

The assembly for the scalar function is:

test_mp_f_:
        push      rbx                                           #24.10
        sub       rsp, 16                                       #24.10
        mov       rbx, rsi                                      #24.10
        vmovss    xmm0, DWORD PTR [rdi]                         #29.3
        vmovss    DWORD PTR [rsp], xmm0                         #29.3[spill]
        call      sinf                                          #29.9
        movsxd    rax, DWORD PTR [rbx]                          #29.8
        vmovss    xmm1, DWORD PTR [rsp]                         #30.1[spill]
        vfmsub213ss xmm1, xmm0, DWORD PTR [-4+test_mp_p_+rax*4] #30.1
        vmovaps   xmm0, xmm1                                    #30.1
        add       rsp, 16                                       #30.1
        pop       rbx                                           #30.1
        ret

It is 13 instructions to process one element.

The assembly of the vector function with explicit length is:

test_mp_vf_:
        push      rbp                                           #33.10
        mov       rbp, rsp                                      #33.10
        and       rsp, -32                                      #33.10
        push      r12                                           #33.10
        push      rbx                                           #33.10
        sub       rsp, 16                                       #33.10
        vmovss    xmm0, DWORD PTR [rsi]                         #40.17
        vbroadcastss xmm1, xmm0                                 #40.7
        vmovups   XMMWORD PTR [rsp], xmm1                       #40.7[spill]
        mov       rbx, QWORD PTR [rdi]                          #40.7
        movsxd    r12, DWORD PTR [rdx]                          #40.16
        call      sinf                                          #40.17
        vbroadcastss xmm2, xmm0                                 #40.17
        vmovups   xmm1, XMMWORD PTR [rsp]                       #40.7[spill]
        vfmsub213ps xmm2, xmm1, XMMWORD PTR [test_mp_p_+r12*4]  #40.7
        vmovups   XMMWORD PTR [rbx], xmm2                       #40.7
        add       rsp, 16                                       #42.1
        pop       rbx                                           #42.1
        pop       r12                                           #42.1
        mov       rsp, rbp                                      #42.1
        pop       rbp                                           #42.1
        ret

Notably, it uses the vfmsub213ps (fused multiply-subtract, packed single). It is 22 instructions, but it updates 4 elements simultaneously. So we have reduced the instruction count per function evaluations to 42 % of scalar mode, and as long as the 4 values show similar convergence toward the root. If we use AVX (ymm registers), the instruction count is reduced to 21 % versus scalar mode, and if AVX-512 (zmm registers) is used, to 11 % percent. The throughput is of course ideally increased by factors of 4, 8, 16.

Sep 04 '23 12:09 ivan-pi