gdl icon indicating copy to clipboard operation
gdl copied to clipboard

test_elmhes and test_formats fail on non-x86_64

Open opoplawski opened this issue 1 year ago • 14 comments

Working on updating the Fedora package to 1.0.5 and getting:

        Start  82: test_elmhes.pro
82: Test command: /builddir/build/BUILD/gdl-v1.0.5/build/src/gdl "-quiet" "-e" "if execute('test_elmhes') ne 1 then exit, status=1"
82: Working Directory: /builddir/build/BUILD/gdl-v1.0.5/build/testsuite
82: Environment variables: 
82:  LC_COLLATE=C
82:  GDL_PATH=/builddir/build/BUILD/gdl-v1.0.5/testsuite/:/builddir/build/BUILD/gdl-v1.0.5/src/pro/
82:  GDL_STARTUP=
82:  IDL_STARTUP=
82: Test timeout computed to be: 3600
82: % Compiled module: TEST_ELMHES.
82: % Compiled module: ERRORS_ADD.
82: % TEST_ELMHES: Error on operation : bad result elmhes
82: % TEST_ELMHES: Error on operation : bad result elmhes,/no_balance
82: % TEST_ELMHES: Error on operation : bad result elmhes,/column
82: % Compiled module: BANNER_FOR_TESTSUITE.
82: % Compiled module: GDL_IDL_FL.
82: % TEST_ELMHES: ===================================================
82: % TEST_ELMHES: =                                                 =
82: % TEST_ELMHES: =  3 errors encountered during TEST_ELMHES tests  =
82: % TEST_ELMHES: =                                                 =
82: % TEST_ELMHES: ===================================================
 82/212 Test  #82: test_elmhes.pro ....................***Failed    0.17 sec
        Start 100: test_formats.pro
100: Test command: /builddir/build/BUILD/gdl-v1.0.5/build/src/gdl "-quiet" "-e" "if execute('test_formats') ne 1 then exit, status=1"
100: Working Directory: /builddir/build/BUILD/gdl-v1.0.5/build/testsuite
100: Environment variables: 
100:  LC_COLLATE=C
100:  GDL_PATH=/builddir/build/BUILD/gdl-v1.0.5/testsuite/:/builddir/build/BUILD/gdl-v1.0.5/src/pro/
100:  GDL_STARTUP=
100:  IDL_STARTUP=
100: Test timeout computed to be: 3600
100: % Compiled module: TEST_FORMATS.
100: % Compiled module: GDL_IDL_FL.
100: % GDL_IDL_FL: Detected Software : GDL
100: % When using the RAN1 mode, be sure to keep the RAN1 and dSFMT seed arrays in separate variables.
100: multiple reference file <<formats.GDL>> found ! First used !!
100: /builddir/build/BUILD/gdl-v1.0.5/build/testsuite/formats.GDL
100: /builddir/build/BUILD/gdl-v1.0.5/testsuite/formats.GDL
100: Files to be compared : formats.IDL, formats.GDL
100: % Compiled module: BANNER_FOR_TESTSUITE.
100: % TEST_FORMATS: =======================================================
100: % TEST_FORMATS: =                                                     =
100: % TEST_FORMATS: =  1595 errors encountered during TEST_FORMATS tests  =
100: % TEST_FORMATS: =                                                     =
100: % TEST_FORMATS: =======================================================
100/212 Test #100: test_formats.pro ...................***Failed    0.65 sec

opoplawski avatar May 21 '24 19:05 opoplawski

Thanks @opoplawski

Looking in the code of test_elmhes.pro, due to the way the tests are done internally, I think these 2 failures (test_elmhes & test_formats) are related to issue in formats :(

I have no way to test on my side on a recent Fedora, and I have no problem on Debian, Ubuntu & OSX !

What is the version of compiler do you have ?

thanks

alaingdl avatar May 22 '24 08:05 alaingdl

This is with gcc 14.1.1. But it's also failing on EL9 with 11.4.1. You can check recent build logs here: https://koji.fedoraproject.org/koji/packageinfo?packageID=1830

opoplawski avatar May 22 '24 13:05 opoplawski

I'm pretty sure formats won't be OK on non 64 bits machines. So some tests based on formatted string comparison won't work either. The thing is, nobody in the team knows what GDL should produce on 32 bit machines! I would suggest to avoid doing these tests on 32 bit machines, as they do not mean that GDL does not work. And wait for an user that reports a specific issue on 32 bit machine.

GillesDuvert avatar May 22 '24 15:05 GillesDuvert

These are all 64 bit architectures - aarch64, ppc64le, s390x

opoplawski avatar May 23 '24 02:05 opoplawski

These are all 64 bit architectures - aarch64, ppc64le, s390x

@opoplawski sorry but your issue refers to "non-x86_64" architectures. My above comment holds: better to remove theses tests from the list of tests in "non-x86_64" architectures building as they are meaningless.

GillesDuvert avatar May 23 '24 08:05 GillesDuvert

I was just responding to your comment about 32-bits. But if the tests only apply to x86_64 that's fine. Although it would be nice if the tests could deselect themselves on non-x86_64. Anyway, I'm excluding them now.

opoplawski avatar May 23 '24 13:05 opoplawski

thanks @opoplawski but I feel there is a misunderstanding: according to internet, s390x is a 32 bit machine when aarch64 is not. Inasmuch as I expect trouble on 32 bit machines, as we have no such machine with a working IDL at our disposal to crosscheck, there should be no problem on a 64 bit little or big endian IEEE 754 architectures. So your report of a test failure is important in this case.

GillesDuvert avatar May 23 '24 15:05 GillesDuvert

s390x is definitely a 64 bit architecture: https://developer.fedoraproject.org/deployment/secondary_architectures/s390.html. s390 is 31/32 bit hybrid. I'll reopen then I guess. Let me know what other information would be helpful for tracking this down.

opoplawski avatar May 23 '24 23:05 opoplawski

@opoplawski, do I understand correctly that the tests pass OK on Fedora arm64 builds? In #1788, we are introducing Apple Silicon builds to CI, but the PR is blocked by two tests failing: test_byte_conversion.pro and test_bytscl.pro; if that is the case, it then seems to be an Apple compiler issue?

slayoo avatar May 24 '24 06:05 slayoo

To go further, one needs at least to know what fails - 1595 errors on test_format: I gues every format is wrong. The test procedure creates a file "formats.GDL". @opoplawski could you send it? For AppleSilicon, I have access to an M1, just need to find the time.

GillesDuvert avatar May 24 '24 09:05 GillesDuvert

OK, I just compiled current git version on a new M2 machine (OSX) and I have the same issues : test_elmhes.pro and test_formats.pro (I will look at test_formats later !)

On x86 processor, IDL & GDL give (first test) :

P               DOUBLE    =   -2.8958759e-07
PT              STRING    = '-00.00000029'
ST              STRING    = '101.32080078'
T               FLOAT     =       101.321
GDL> print, b
     0.500000      11.4800      5.50000      5.00000
      6.25000      30.2200      20.7500      14.5000
     0.680000      3.02080      1.28000      1.28000
     0.360000     0.500000      0.00000      0.00000

But on M2:

P               DOUBLE    =        0.0000000
PT              STRING    = '000.00000000'
ST              STRING    = '101.32079315'
T               FLOAT     =       101.321

GDL> print, b
     0.500000      11.4800      5.50000      5.00000
      6.25000      30.2200      20.7500      14.5000
     0.680000      3.02080      1.28000      1.28000
     0.360000     0.500000      0.00000      0.00000

Then from my point of view just numerical rounding and the test should be rewritten taking into account EPS

alaingdl avatar May 29 '24 21:05 alaingdl

Certainly. The cumulative rounding errors make our results different between machines, and, most of all, different with IDL that does not use the same algorithms. The difficulty is to fix a safe error margin, as precisions can well drop down to 10-3 for floats.

GillesDuvert avatar Jun 04 '24 13:06 GillesDuvert

I updated test_elmhes.pro in Pr #1840 with a numerical tolerance of 1e-5. For me it is close.

Concerning test_formats.pro, from what I see in the outputs, we do have a big/little indian problem ... It is a serious issue. The good news is I have now a permanent access to a M2 OSX machine (very fast feed. But Is have no time now, and I feel not competent on that. But maybe a simple flag could solve most of the problems. I hope @GillesDuvert will have time for that since he previously improved formats ...

alaingdl avatar Jun 04 '24 14:06 alaingdl

The only differences are on unsigned 32 and bits ints and +/-NaN and +INF. I would not say it is an endianess problem.

GillesDuvert avatar Jun 04 '24 15:06 GillesDuvert

see #1949 : some machines (ARM64) do not convert to unsigned ints as on Intel. NaN and INF issues in test_formats come from the fact that these floating-point pseudo-values are converted to unsigned integers (rather than bit fields?) before printing bits (to print we use C and C++ standards). #1949 should have suppressed the float-to-unsigned-int difference of conversion between IA64 and ARM64 (and others, probably). In other words: apart some NaN and Inf 'printing' problems, no more tested in test_formats, there should be no difference anymore.

GillesDuvert avatar Dec 16 '24 18:12 GillesDuvert

Closing with the above explanation, dear Orion you can open a new issue if there is another 'portablity' problem.

GillesDuvert avatar Dec 16 '24 18:12 GillesDuvert