gdl icon indicating copy to clipboard operation
gdl copied to clipboard

OSX: performance issues within Clang + duplicate symboles with g++-13

Open alaingdl opened this issue 2 years ago • 6 comments

OK, the performance issues within FOR loops detected first on Mac M2/M3 is in fact also here on x86_64

  • Linux U22 my old laptop / gcc time_test4 : 1.06796=Total Time

The OSX versions here were compiled with the script, and OpenMP is declared as ON (all tests : 4, 5, 16, 25 are bad, but also 2 regress since clang 17 :(

  • OSX gdl-1.0.2git230313 : clang 15.0.7_1 time_test4 : 21.8063=Total Time

  • OSX gdl-1.0.2git230420 : clang 16.0.1 time_test4 : 21.9463=Total Time (case 2 0.206096 Foreach, 6000000 elements

  • OSX gdl-1.0.3git231123CMake: clang 17.0.4 time_test4 : 70.3176=Total Time (case 2 : 49.0947 Foreach, 6000000 elements)

  • OSX gdl-1.0.4git240222CMake: clang 17.0.6_1 time_test4 : 69.5246=Total Time

Unfortunately I cannot finish the compilation with GCC 13 because of duplicates symbols

CC=/usr/local/bin/gcc-13 CXX=/usr/local/bin/g++-13 cmake .. -DREADLINE=no -DHDF=OFF -DHDF5=OFF -DPYTHON=off -DGRAPHICSMAGICK=off -DMAGICK=OFF -DWXWIDGETS=off -DQHULL=off

[...]  // the first ones 

[ 15%] Linking CXX executable gdl
duplicate symbol '__ZTS5Data_I10SpDComplexE' in:
    CMakeFiles/gdl.dir/datatypes.cpp.o
    CMakeFiles/gdl.dir/basic_op.cpp.o
duplicate symbol '__ZTI5Data_I10SpDComplexE' in:
    CMakeFiles/gdl.dir/datatypes.cpp.o
    CMakeFiles/gdl.dir/basic_op.cpp.o


[...] // the last ones

duplicate symbol '__ZTS5Data_I9SpDLong64E' in:
    CMakeFiles/gdl.dir/datatypes.cpp.o
    CMakeFiles/gdl.dir/ofmt.cpp.o
duplicate symbol '__ZTI5Data_I9SpDLong64E' in:
    CMakeFiles/gdl.dir/datatypes.cpp.o
    CMakeFiles/gdl.dir/ofmt.cpp.o
ld: 252 duplicate symbols for architecture x86_64
collect2: error: ld returned 1 exit status

datatypes.cpp.o is always involved ...

alaingdl avatar Feb 22 '24 13:02 alaingdl

I confirm that -fsanitize=address makes gdl 100 times faster for code related to memory transfer (copy from variable to variable) on a Mac mini with M1. The code to be tested is simple:

GDL> tic & for i=1L,600000 do a=1 & toc
% Time elapsed : 4.5299740 seconds.

which takes 0.057317972 seconds on my intel linux laptop, gcc compiler, no eigen:: As

GDL> tic & for i=1L,600000 do a=a & toc
% Time elapsed : 0.019397974 seconds.

is internally optimised to do nothing (a=a !!!), 0.019397974 seconds measures the empty loop speed, which is OK.

This restricts the area of the problem to a very tiny number of code lines, essentially what happens in "a=1".

GillesDuvert avatar Mar 02 '24 12:03 GillesDuvert

@alaingdl the multiply defined symbol have already been encountered ( #677 , #734) , and should indeed be avoided. However there always were compiler options to circumvent that problem which arises only on a limited number of platforms.

GillesDuvert avatar Mar 02 '24 12:03 GillesDuvert

CULPRIT FOUND!!!

On OSX, for obscure historical reasons, and given that the system defines HAVE_MALLOC_ZONE_STATISTICS and HAVE_MALLOC_MALLOC_H, the very very inner code for destruction of variables would call the obscure UpdateCurrent() function to report precise memory useage. The loss of time is tremendous, and would have been seen in a profiler by the enormous number of calls to strange functions like malloc_zone_statistics() etc.

making UpdateCurrent() just return solves the speed problem, time_test4 drops to 1 sec.

GillesDuvert avatar Mar 02 '24 18:03 GillesDuvert

Just commited the single-liner that is supposed to do wonders.

GillesDuvert avatar Mar 02 '24 18:03 GillesDuvert

@GillesDuvert : brilliant ! Thanks

tested on a intel OSX, using the script ...

GDL> time_test4
[...]
      1.10098=Total Time,      0.021701576=Geometric mean,      25 tests.

GDL> TEST_LOOPS
% Time elapsed : 0.0098431110 seconds.
% Time elapsed : 0.010197878 seconds.
% Time elapsed : 0.0053970814 seconds.
% Time elapsed : 0.0092120171 seconds.

alaingdl avatar Mar 02 '24 20:03 alaingdl

Congrats!!!!

On 2. Mar 2024, at 15:09, Giloo @.***> wrote:

CULPRIT FOUND!!!

On OSX, for obscure historical reasons, and given that the system defines HAVE_MALLOC_ZONE_STATISTICS and HAVE_MALLOC_MALLOC_H, the very very inner code for destruction of variables would call the obscure UpdateCurrent() function to report precise memory useage. The loss of time is tremendous, and would have been seen in a profiler by the enormous number of calls to strange functions like malloc_zone_statistics() etc.

making UpdateCurrent() just return solves the speed problem, time_test4 drops to 1 sec.

— Reply to this email directly, view it on GitHub https://github.com/gnudatalanguage/gdl/issues/1755#issuecomment-1974868089, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOC5K6HM546IOUDCHT5XCO3YWIIXDAVCNFSM6AAAAABDU7VVQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZUHA3DQMBYHE. You are receiving this because you are subscribed to this thread.

brandy125 avatar Mar 02 '24 20:03 brandy125

#1776

GillesDuvert avatar Mar 09 '24 10:03 GillesDuvert