Vectorized interface
It could be nice to support a vector-instruction-friendly interface. There is an open issue for this in https://github.com/scipy/scipy/issues/7242. The purpose is to solve a family of problems:
$$f(x_k,p_k) = 0, \qquad k = 1, ..., N$$
Ideally, the $p_k$ are somehow related in their physical origin (e.g. a spatial field), so the convergence behavior locally in terms of $k$ will be similar. For root-finding on multi-dimensional parameter grids, e.g. $p_{j,k}$, one can use the Fortran pointer-remapping feature to make the problems 1-D.
I did some benchmarking of zeroin recently (discussion here). The code runs in scalar mode for the most part. The complex control flow prevents vectorization (view on godbolt; code taken from netlib):
$ gfortran-13 -c -O3 -march=haswell -fopt-info-missed zeroin.f
zeroin.f:124:72: missed: not inlinable: zeroin/0 -> __builtin_copysign/1, function body not available
zeroin.f:60:72: missed: couldn't vectorize loop
zeroin.f:60:72: missed: not vectorized: control flow in loop.
zeroin.f:62:10: missed: couldn't vectorize loop
zeroin.f:62:10: missed: not vectorized: control flow in loop.
zeroin.f:125:72: missed: statement clobbers memory: fb_113 = f_86(D) (&b);
zeroin.f:53:72: missed: statement clobbers memory: fa_88 = f_86(D) (&a);
zeroin.f:54:72: missed: statement clobbers memory: fb_90 = f_86(D) (&b);
zeroin.f:125:72: missed: statement clobbers memory: fb_113 = f_86(D) (&b);
One may need to write the implementation in terms of "chunks" with reductions, which would then map to SIMD registers.
I'm not sure how the vector callback interface would look like, and what is the natural way to pass the parameters in a way that makes it SIMD-friendly:
integer, parameter :: vlen = 4 ! or 8 (depending on real kind and instruction set)
! Explicit length
function vf(x,k)
real, intent(in) :: x
integer, intent(in) :: k
real :: vf(vlen)
! using loop
!$omp simd
do j = 1, vlen
vf(j) = ... ! F(x,p(k + j)), p imported from host scope
end do
! or array expressions
vf = ... ! F(x,p(k:k+vlen)), p imported from host scope
end function
! Scalar callback vectorized using OpenMP
function f(x,k)
!$omp declare simd uniform(x) linear(k: 1)
real, intent(in) :: x
integer, intent(in) :: k ! index i needed to index into parameter array
real :: f
f = ... ! F(x,p(k)), p imported from host scope
end function
I suppose you could also do a callback of the form:
subroutine root_callback(x,lb,ub,y)
real, intent(in) :: x
integer, intent(in) :: lb, ub ! indexes needed to reference into parameter array
real, intent(out) :: y(lb:ub)
! user writes the loop
!$omp simd
do k = lb, ub
y(k) = ... ! F(x,p(k)), p imported from host scope
end do
! or using array expressions
associate (p => params(lb:ub))
y = ... ! F(x,p), params imported from host scope
end associate
end subroutine
This way the SIMD length could be left to the program logic, and it would make it easier to handle peel/remainder loops. A more Fortranic way would be to use do concurrent (assuming compilers will do the right thing).
I've made a small example on Godbolt, using the function:
$$f(x,p_k) = x \sin(x) - p_k$$
The assembly for the scalar function is:
test_mp_f_:
push rbx #24.10
sub rsp, 16 #24.10
mov rbx, rsi #24.10
vmovss xmm0, DWORD PTR [rdi] #29.3
vmovss DWORD PTR [rsp], xmm0 #29.3[spill]
call sinf #29.9
movsxd rax, DWORD PTR [rbx] #29.8
vmovss xmm1, DWORD PTR [rsp] #30.1[spill]
vfmsub213ss xmm1, xmm0, DWORD PTR [-4+test_mp_p_+rax*4] #30.1
vmovaps xmm0, xmm1 #30.1
add rsp, 16 #30.1
pop rbx #30.1
ret
It is 13 instructions to process one element.
The assembly of the vector function with explicit length is:
test_mp_vf_:
push rbp #33.10
mov rbp, rsp #33.10
and rsp, -32 #33.10
push r12 #33.10
push rbx #33.10
sub rsp, 16 #33.10
vmovss xmm0, DWORD PTR [rsi] #40.17
vbroadcastss xmm1, xmm0 #40.7
vmovups XMMWORD PTR [rsp], xmm1 #40.7[spill]
mov rbx, QWORD PTR [rdi] #40.7
movsxd r12, DWORD PTR [rdx] #40.16
call sinf #40.17
vbroadcastss xmm2, xmm0 #40.17
vmovups xmm1, XMMWORD PTR [rsp] #40.7[spill]
vfmsub213ps xmm2, xmm1, XMMWORD PTR [test_mp_p_+r12*4] #40.7
vmovups XMMWORD PTR [rbx], xmm2 #40.7
add rsp, 16 #42.1
pop rbx #42.1
pop r12 #42.1
mov rsp, rbp #42.1
pop rbp #42.1
ret
Notably, it uses the vfmsub213ps (fused multiply-subtract, packed single). It is 22 instructions, but it updates 4 elements simultaneously. So we have reduced the instruction count per function evaluations to 42 % of scalar mode, and as long as the 4 values show similar convergence toward the root. If we use AVX (ymm registers), the instruction count is reduced to 21 % versus scalar mode, and if AVX-512 (zmm registers) is used, to 11 % percent. The throughput is of course ideally increased by factors of 4, 8, 16.