flang
flang copied to clipboard
AArch64: significantly improve formatted input performance by using optimized libc functions on ARM64
Our experiments proved that use of memset and memcpy instead of explicit while loops gives dramatic speedup (at least on AArch64 hosts) on formatted input.
Do you have test cases that show the performance improvements? Why do you think this change should be architecture dependent?
The test case I was using is following:
! *********************************************************
program main
implicit none
! ---------------------------------------------------
character(len=500) :: cart
real(kind=8) :: t1,t2
! ---------------------------------------------------
open(unit=9,status='old',file='my_file.txt')
open(unit=10,file='my_new_file.txt')
call cpu_time(t1)
do
! read each line
read(9,fmt='(A)') cart
! ************************************************************
! ************************************************************
! convert process
! ************************************************************
! ************************************************************
if(cart(1:4)=='/end') then
write(10,*) 'this is the end!'
exit
else
write(10,*) cart
endif
enddo
call cpu_time(t2)
close(unit=9)
close(unit=10)
print*,' write and read :',t2-t1
! ---------------------------------------------------
end program main
! *********************************************************
Compiled with gfortran it gives much better timing results that when compiled with flang. My patch improves the timing of flang compiled program dramatically.
Regarding architecture dependency, string.h functions in glibc were carefully optimized for AArch64 and this can be observed in the results of the above test case. I can't guarantee the same for other architectures also I can't guarantee than on all of the architectures replacement of local loop with a function call should never cause performance drop.