neural-fortran suggestions for performance

I'm using this package in a very performance-critical setting. I managed to make the inference quite a bit faster by making some changes (some of these are custom to my model). Below is the inference for a "flat" feed-forward model which takes an 1D array of inputs. I also experimented with inputting multiple samples at a time (2D array) which replaces matrix-vector multiplications with matrix-matrix. This may be faster for some platforms/models (for me it was a wash) but in this case make sure to use SGEMM/DGEMM to replace the matmul call.

subroutine output_flatmodel_opt_sig(self, x, neurons, output)
    ! Use forward propagation to compute the output of the network.
    ! For computational efficiency, following changes are implemented:
    ! 1) Outputs are allocated outside of function, 
    ! 2) use of explicit-shape intermediate array that assumes the number of neurons are the same for all hidden layers,
    ! 3) activation functions are replaced with a subroutine that modifies the arguments (sigmoid), activation from final layer removed (linear activation=redundant 1:1 copy)
    ! 4) matmul replaced by custom function which is often faster than matmul for matrix-vector multiplication
    ! 5) weights have been pre-transposed in the load routine, class variable w_transposed.
    ! This procedure was much faster than the original when using gfortran -O3 -march=native or ifort -O3.
    ! For lower optimization levels the custom function (4) may be SLOWER
    class(network_type),    intent(in)  :: self
    integer,                intent(in)  :: neurons
    real(wp), dimension(:), intent(in)  :: x
    real(wp), dimension(:), intent(out) :: output
    ! Local variables
    real(wp), dimension(neurons)        :: a
    integer :: n

    associate(layers => self % layers)
      a = matvecmul(layers(1) % w_transposed,x,neurons,size(x)) + layers(2) % b
      ! sigmoid activation: using an "inout" subroutine to avoid array copy 
      call sigmoid_subroutine(a)
      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        a = matvecmul(layers(n-1) % w_transposed, a, neurons, neurons) + layers(n) % b
        call sigmoid_subroutine(a)
      end do
      ! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
      output = (matvecmul(layers(n-1) % w_transposed, a, size(output), neurons) + layers(n) % b)
    end associate
  end subroutine

  function matvecmul(matA,vecB,nrow,ncol)
        implicit none
        integer, intent(in) :: nrow,ncol
        real(wp), intent(in) :: matA(nrow,ncol)
        real(wp), intent(in) :: vecB(ncol)
        integer :: i,j
        real(wp) :: matvecmul(nrow)

        matvecmul = 0.0_wp
        do j=1,ncol !  
            matvecmul = matvecmul + matA(:,j) * vecB(j)
        enddo

  end function matvecmul

  subroutine sigmoid_subroutine(x)
    real(wp), dimension(:), intent(inout) :: x
    x = 1 / (1 + exp(-x))
  end subroutine

Sep 30 '19 14:09 peterukk

Thanks Peter, it all makes sense considering that the equal-size hidden layers allows for less allocations and copies to be made. What kind of performance improvement are you getting? I'm interested in both GNU and Intel compilers.

Sep 30 '19 21:09 milancurcic

gfortran -ffree-line-length-none -std=f2008 -pg -march=native -O3 --fast-math Elapsed time on output: 0.939000010
Elapsed time on output_opt_sig: 0.680999994
Elapsed time on output_flatmodel_opt_sig: 0.669000030

gfortran -ffree-line-length-none -m64 -std=f2008 -march=native -O3 --fast-math -funroll-loops -ftree-loop-linear -fprefetch-loop-arrays

Elapsed time on output: 0.732999980
Elapsed time on output_opt_sig: 0.486000001
Elapsed time on output_flatmodel_opt_sig: 0.485000014

This on gfortran-7, intel processor and a model with (50,50) hidden neurons, 21 inputs and 256 outputs. I don't have the intel compilers on this computer but I do at home, I can get back to you.

  subroutine output_opt_sig(self, x, output)
    class(network_type),    intent(in)  :: self
    real(wp), dimension(:), intent(in)  :: x
    real(wp), dimension(:), intent(out) :: output
    ! Local variables
    real(wp), allocatable   :: a(:)
    integer,  dimension(2)  :: matsize
    integer                 :: n

    associate(layers => self % layers)
      matsize = shape(layers(1) % w_transposed)
      a = matvecmul(layers(1) % w_transposed, x, matsize(1), matsize(2)) + layers(2) % b
      ! sigmoid activation: using an "inout" subroutine to avoid array copy 
      call sigmoid_subroutine(a)
      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        matsize = shape(layers(n-1) % w_transposed)
        a = matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b
        call sigmoid_subroutine(a)
      end do
      ! LAST LAYER (LINEAR ACTIVATION = do nothing, just add biases)
      matsize = shape(layers(n-1) % w_transposed)
      output = (matvecmul(layers(n-1) % w_transposed, a, matsize(1), matsize(2)) + layers(n) % b)
    end associate
  end subroutine

Oct 01 '19 10:10 peterukk

On ifort the differences are much bigger:

ifort -O3 -mkl

Elapsed time on output: 0.8096000
Elapsed time on output_opt_sig: 0.1968000
Elapsed time on output_flatmodel_opt_sig: 0.1787000 Elapsed time on output_sgemm_sig: 0.1423000

Here the last routine processes all the input data at once as mentioned (for the other ones, I used a loop).

subroutine output_sgemm_sig(self, x, num_sample, output)
    ! Use this routine for a 2D input data array to process all the samples simultaenously in a feed-forward network
   ! Replace the sigmoid activation subroutine with whatever activation you're using
    ! Author: Peter Ukkonen
    ! sgemm = single-precision (wp = sp)
    class(network_type),      intent(in)    :: self
    real(wp), dimension(:,:), intent(in)    :: x      ! (features, num_sample)
    real(wp), dimension(:,:), intent(out)   :: output ! (outputs, num_sample)
    integer,                  intent(in)    :: num_sample
    ! Local variables
    real(wp), allocatable   :: a(:,:), a_next(:,:)
    real(wp)                :: alpha, beta
    integer,  dimension(2)  :: matsize
    integer                 :: n, isample

    alpha = 1.0_wp
    beta = 0.0_wp
    output = 0.0_wp

    associate(layers => self % layers)
      matsize = shape(layers(1) % w_transposed)
      allocate(a(matsize(1),num_sample))
      call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(1) % w_transposed, matsize(1), x, matsize(2), beta, a, matsize(1))
      ! a = a + spread( layers(2) % b, 1, num_sample )
      do isample = 1, num_sample
        a(:,isample) = a(:,isample ) + layers(2) % b
      end do
      call sigmoid_subroutine_mat(a)

      ! INTERMEDIATE LAYERS
      do n = 3, size(layers)-1
        matsize = shape(layers(n-1) % w_transposed)
        allocate(a_next(matsize(1),num_sample))

        call sgemm("N","N",matsize(1),num_sample,matsize(2),alpha,layers(n-1) % w_transposed,matsize(1),a,matsize(2),beta,a_next,matsize(1))
        deallocate(a)
        ! a = a + spread( layers(n) % b, 1, num_sample )
        do isample = 1, num_sample
          a_next(:,isample) = a_next(:,isample ) + layers(n) % b
        end do 
        call sigmoid_subroutine_mat(a_next)
        a = a_next
        deallocate(a_next)
      end do

      matsize = shape(layers(n-1) % w_transposed)
      call sgemm("N","N",matsize(1), num_sample, matsize(2), alpha, layers(n-1) % w_transposed, matsize(1), a, matsize(2), beta, output, matsize(1))
      ! a = a + spread( layers(n) % b, 1, num_sample )
      do isample = 1, num_sample
        output(:,isample) = output(:,isample ) + layers(n) % b
      end do
    end associate
  end subroutine

Oct 01 '19 13:10 peterukk

Thanks. Do you know if any of these could be implemented in the existing class that doesn't assume equal sizes of hidden layers?

The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.

Btw, the 10-output-batch branch introduces the function to output per batch (contributed by @jvdp1), although it's just a wrapper around the output for a single set of inputs, so it doesn't have any performance improvements that you propose. The output_batch method could be a place to introduce these, if any are feasible. I will take some time soon to clean up and merge the new methods into master.

Oct 01 '19 19:10 milancurcic

output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work). As you can see, the former is almost as fast as the flatmodel one. Processing inputs as matrices is fastest on ifort. I believe the most significant optimizations for me were the use of matvecmul instead of matmul, removal of the output activation, and the use of subroutines instead of functions to avoid array copy. I think you could quite easily implement these changes. I believe the weights need to be pre-transposed for matvecmul to be faster than matmul.

Oct 02 '19 08:10 peterukk

The only opportunity I see is to allocate the activation arrays once rather than re-allocate on assignment in every iteration, in which case the subroutine to calculate activations in-place may be effective.

That might be a good idea, although based on the previous result the allocation overhead isn't that big..I'm by no means an expert on Fortran by the way, this has been an interesting learning experience!

Oct 02 '19 08:10 peterukk

output_opt_sig and output_sgemm_sig do not assume equal-sized hidden layers (I tested that they work)

OK, great, thanks, I overlooked that. In general, I don't want to break the high-level functional API. However, I'm all for tweaking the internal implementation if performance benefits are clear. You give clear ideas and examples how to do this. I will play with it and let you know.

Oct 03 '19 13:10 milancurcic

You're welcome! I have now implemented these changes in a more general manner here: https://github.com/peterukk/rte-rrtmgp/blob/master/neural/

Most notably, activation functions have been replaced with subroutines, and these procedures have a 2D array variant which are used in e.g. output_sgemm

Nov 05 '19 14:11 peterukk

FYI I now tested using elemental subroutines as activation functions. This is considerably faster (and work with both 1D and 2D arrays), but unfortunately, pointers do not work with elemental procedures :/ disappointed that object-oriented programming leads to a performance loss in this instance

Jan 07 '20 16:01 peterukk