suggestions for performance
I'm using this package in a very performance-critical setting. I managed to make the inference quite a bit faster by making some changes (some of these are custom to my model). Below is the inference for a "flat" feed-forward model which takes an 1D array of inputs. I also experimented with inputting multiple samples at a time (2D array) which replaces matrix-vector multiplications with matrix-matrix. This may be faster for some platforms/models (for me it was a wash) but in this case make sure to use SGEMM/DGEMM to replace the matmul call.
subroutine output_flatmodel_opt_sig(self, x, neurons, output)
! Use forward propagation to compute the output of the network.
! For computational efficiency, following changes are implemented:
! 1) Outputs are allocated outside of function,
! 2) use of explicit-shape intermediate array that assumes the number of neurons are the same for all hidden layers,
! 3) activation functions are replaced with a subroutine that modifies the arguments (sigmoid), activation from final layer removed (linear activation=redundant 1:1 copy)
! 4) matmul replaced by custom function which is often faster than matmul for matrix-vector multiplication
! 5) weights have been pre-transposed in the load routine, class variable w_transposed.
! This procedure was much faster than the original when using gfortran -O3 -march=native or ifort -O3.
! For lower optimization levels the custom function (4) may be SLOWER
class(network_type), intent(in) :: self
integer, intent(in) :: neurons
real(wp), dimension(:), intent(in) :: x
real(wp), dimension(:), intent(out) :: output
! Local variables
real(wp), dimension(neurons) :: a
integer :: n
associate(layers => self % layers)
a = matvecmul(layers(1) % w_transposed,x,neurons,size(x)) + layers(2) % b
! sigmoid activation: using an "inout" subroutine to avoid array copy
call sigmoid_subroutine(a)
! INTERMEDIATE LAYERS
do n = 3, size(layers)-1
a = matvecmul(layers(n-1) % w_transposed, a, neurons, neurons) + layers(n) % b
call sigmoid_subroutine(a)
end do
! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
output = (matvecmul(layers(n-1) % w_transposed, a, size(output), neurons) + layers(n) % b)
end associate
end subroutine
function matvecmul(matA,vecB,nrow,ncol)
implicit none
integer, intent(in) :: nrow,ncol
real(wp), intent(in) :: matA(nrow,ncol)
real(wp), intent(in) :: vecB(ncol)
integer :: i,j
real(wp) :: matvecmul(nrow)
matvecmul = 0.0_wp
do j=1,ncol !
matvecmul = matvecmul + matA(:,j) * vecB(j)
enddo
end function matvecmul
subroutine sigmoid_subroutine(x)
real(wp), dimension(:), intent(inout) :: x
x = 1 / (1 + exp(-x))
end subroutine
Thanks Peter, it all makes sense considering that the equal-size hidden layers allows for less allocations and copies to be made. What kind of performance improvement are you getting? I'm interested in both GNU and Intel compilers.
gfortran -ffree-line-length-none -std=f2008 -pg -march=native -O3 --fast-math
Elapsed time on output: 0.939000010
Elapsed time on output_opt_sig: 0.680999994
Elapsed time on output_flatmodel_opt_sig: 0.669000030
gfortran -ffree-line-length-none -m64 -std=f2008 -march=native -O3 --fast-math -funroll-loops -ftree-loop-linear -fprefetch-loop-arrays
Elapsed time on output: 0.732999980
Elapsed time on output_opt_sig: 0.486000001
Elapsed time on output_flatmodel_opt_sig: 0.485000014
This on gfortran-7, intel processor and a model with (50,50) hidden neurons, 21 inputs and 256 outputs. I don't have the intel compilers on this computer but I do at home, I can get back to you.
subroutine output_opt_sig(self, x, output)
class(network_type), intent(in) :: self
real(wp), dimension(:), intent(in) :: x
real(wp), dimension(:), intent(out) :: output
! Local variables
real(wp), allocatable :: a(:)
integer, dimension(2) :: matsize
integer :: n
associate(layers => self % layers)
matsize = shape(layers(1) % w_transposed)
a = matvecmul(layers(1) % w_transposed, x, matsize(1), matsize(2)) + layers(2) % b
! sigmoid activation: using an "inout" subroutine to avoid array copy
call sigmoid_subroutine(a)
! INTERMEDIATE LAYERS
do n = 3, size(layers)-1
matsize = shape(layers(n-1) % w_transposed)
a = matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b
call sigmoid_subroutine(a)
end do
! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
matsize = shape(layers(n-1) % w_transposed)
output = (matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b)
end associate
end subroutine
On ifort the differences are much bigger:
ifort -O3 -mkl
Elapsed time on output: 0.8096000
Elapsed time on output_opt_sig: 0.1968000
Elapsed time on output_flatmodel_opt_sig: 0.1787000
Elapsed time on output_sgemm_sig: 0.1423000
Here the last routine processes all the input data at once as mentioned (for the other ones, I used a loop).
subroutine output_sgemm_sig(self, x, num_sample, output)
! Use this routine for a 2D input data array to process all the samples simultaenously in a feed-forward network
! Replace the sigmoid activation subroutine with whatever activation you're using
! Author: Peter Ukkonen
! sgemm = single-precision (wp = sp)
class(network_type), intent(in) :: self
real(wp), dimension(:,:), intent(in) :: x ! (features, num_sample)
real(wp), dimension(:,:), intent(out) :: output ! (outputs, num_sample)
integer, intent(in) :: num_sample
! Local variables
real(wp), allocatable :: a(:,:), a_next(:,:)
real(wp) :: alpha, beta
integer, dimension(2) :: matsize
integer :: n, isample
alpha = 1.0_wp
beta = 0.0_wp
output = 0.0_wp
associate(layers => self % layers)
matsize = shape(layers(1) % w_transposed)
allocate(a(matsize(1),num_sample))
call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(1) % w_transposed, matsize(1), x, matsize(2), beta, a, matsize(1))
! a = a + spread( layers(2) % b, 1, num_sample )
do isample = 1, num_sample
a(:,isample) = a(:,isample ) + layers(2) % b
end do
call sigmoid_subroutine_mat(a)
! INTERMEDIATE LAYERS
do n = 3, size(layers)-1
matsize = shape(layers(n-1) % w_transposed)
allocate(a_next(matsize(1),num_sample))
call sgemm("N","N",matsize(1),num_sample,matsize(2),alpha,layers(n-1) % w_transposed,matsize(1),a,matsize(2),beta,a_next,matsize(1))
deallocate(a)
! a = a + spread( layers(n) % b, 1, num_sample )
do isample = 1, num_sample
a_next(:,isample) = a_next(:,isample ) + layers(n) % b
end do
call sigmoid_subroutine_mat(a_next)
a = a_next
deallocate(a_next)
end do
matsize = shape(layers(n-1) % w_transposed)
call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(n-1) % w_transposed, matsize(1), a, matsize(2), beta, output, matsize(1))
! a = a + spread( layers(n) % b, 1, num_sample )
do isample = 1, num_sample
output(:,isample) = output(:,isample ) + layers(n) % b
end do
end associate
end subroutine
Thanks. Do you know if any of these could be implemented in the existing class that doesn't assume equal sizes of hidden layers?
The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.
Btw, the 10-output-batch branch introduces the function to output per batch (contributed by @jvdp1), although it's just a wrapper around the output for a single set of inputs, so it doesn't have any performance improvements that you propose. The output_batch method could be a place to introduce these, if any are feasible. I will take some time soon to clean up and merge the new methods into master.
output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work). As you can see, the former is almost as fast as the flatmodel one. Processing inputs as matrices is fastest on ifort. I believe the most significant optimizations for me were the use of matvecmul instead of matmul, removal of the output activation, and the use of subroutines instead of functions to avoid array copy. I think you could quite easily implement these changes. I believe the weights need to be pre-transposed for matvecmul to be faster than matmul.
The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.
That might be a good idea, although based on the previous result the allocation overhead isn't that big..I'm by no means an expert on Fortran by the way, this has been an interesting learning experience!
output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work)
OK, great, thanks, I overlooked that. In general, I don't want to break the high-level functional API. However, I'm all for tweaking the internal implementation if performance benefits are clear. You give clear ideas and examples how to do this. I will play with it and let you know.
You're welcome! I have now implemented these changes in a more general manner here: https://github.com/peterukk/rte-rrtmgp/blob/master/neural/
Most notably, activation functions have been replaced with subroutines, and these procedures have a 2D array variant which are used in e.g. output_sgemm
FYI I now tested using elemental subroutines as activation functions. This is considerably faster (and work with both 1D and 2D arrays), but unfortunately, pointers do not work with elemental procedures :/ disappointed that object-oriented programming leads to a performance loss in this instance