flang icon indicating copy to clipboard operation
flang copied to clipboard

AArch64: significantly improve formatted input performance by using optimized libc functions on ARM64

Open pawosm-arm opened this issue 6 years ago • 3 comments

Our experiments proved that use of memset and memcpy instead of explicit while loops gives dramatic speedup (at least on AArch64 hosts) on formatted input.

pawosm-arm avatar Dec 19 '18 22:12 pawosm-arm

Do you have test cases that show the performance improvements? Why do you think this change should be architecture dependent?

sscalpone avatar Dec 21 '18 01:12 sscalpone

The test case I was using is following:

! *********************************************************
        program main

        implicit none
!       ---------------------------------------------------
        character(len=500) :: cart
        real(kind=8) :: t1,t2

!       ---------------------------------------------------
        open(unit=9,status='old',file='my_file.txt')
        open(unit=10,file='my_new_file.txt')
        call cpu_time(t1)
        do
!               read each line
                read(9,fmt='(A)') cart
!       ************************************************************
!       ************************************************************
!                       convert  process
!       ************************************************************
!       ************************************************************

                if(cart(1:4)=='/end') then
                        write(10,*) 'this is the end!'
                        exit
                else
                        write(10,*) cart
                endif
        enddo
        call cpu_time(t2)
        close(unit=9)
        close(unit=10)

        print*,' write and read :',t2-t1




!       ---------------------------------------------------

        end program main
! *********************************************************

my_file.txt.gz

Compiled with gfortran it gives much better timing results that when compiled with flang. My patch improves the timing of flang compiled program dramatically.

pawosm-arm avatar Dec 21 '18 15:12 pawosm-arm

Regarding architecture dependency, string.h functions in glibc were carefully optimized for AArch64 and this can be observed in the results of the above test case. I can't guarantee the same for other architectures also I can't guarantee than on all of the architectures replacement of local loop with a function call should never cause performance drop.

pawosm-arm avatar Dec 21 '18 15:12 pawosm-arm