gdl icon indicating copy to clipboard operation
gdl copied to clipboard

`byte_conversion`, `bytscl` and `formats` tests fail on Apple Silicon CI

Open slayoo opened this issue 1 year ago • 2 comments

@gnudatalanguage/gdldevs, anyone having access to an Apple Silicon machine, help welcome! Thanks

Example output (https://github.com/gnudatalanguage/gdl/actions/runs/10431443505/job/28891119074)

66/214 Test  #66: test_byte_conversion.pro ...........***Failed    0.65 sec
...
% TEST_BYTE_BASIC_32768: Error on operation : input as Long for :       -32765
% TEST_BYTE_BASIC_32768: Error on operation : input as Long for :       -32766
% TEST_BYTE_BASIC_32768: Error on operation : input as Long for :       -32767
% TEST_BYTE_BASIC_32768: Error on operation : input as Long for :       -32769
% TEST_BYTE_BASIC_32768: Error on operation : input as Long for :       -32770
...
 67/214 Test  #67: test_bytscl.pro ....................***Failed    0.69 sec
...
% TEST_BYTSCL_TOP: Error on operation : Pb with top=     -128, min/max :        0,        0
...
102/214 Test #102: test_formats.pro ...................***Failed    1.13 sec
...
% When using the RAN1 mode, be sure to keep the RAN1 and dSFMT seed arrays in separate variables.
multiple reference file <<formats.GDL>> found ! First used !!
/Users/runner/work/gdl/build/testsuite/formats.GDL
/Users/runner/work/gdl/gdl/testsuite/formats.GDL
Files to be compared : formats.IDL, formats.GDL
...
% TEST_FORMATS: =  1595 errors encountered during TEST_FORMATS tests  =
...
The following tests FAILED:
	 66 - test_byte_conversion.pro (Failed)
	 67 - test_bytscl.pro (Failed)
	102 - test_formats.pro (Failed)

slayoo avatar Aug 17 '24 11:08 slayoo

OK, but too much troubles on OSX now, I need some quiet time to investigate :( (short answer : when compilating without the script, only test_formats is broken --but I experienced glitches with the tests, I don't know why-- when compilating with the script, maybe due to wxwidget, everything is broken)

alaingdl avatar Aug 19 '24 09:08 alaingdl

see #1881

GillesDuvert avatar Sep 13 '24 15:09 GillesDuvert

as of now, we have the following failures on CI:

  • macos-14 (i.e., Apple Silicon): test_hdf5 & test_formats
  • macos-12 (i.e., Intel): test_hdf5

The test_hdf failure is:

112/214 Test #112: test_hdf5.pro ......................***Failed    0.47 sec
% Compiled module: TEST_HDF5.
% Compiled module: FILE_SEARCH_FOR_TESTSUITE.
% Compiled module: BANNER_FOR_TESTSUITE.
% Compiled module: GDL_IDL_FL.
% TEST_HDF5_BYTE_ARRAY: 
  NO errors encountered during TEST_HDF5_BYTE_ARRAY tests  
% TEST_HDF5_STRING:   NO errors encountered during TEST_HDF5_STRING tests  
h5dump error: unable to open attribute "attr-02"
% TEST_HDF5_ATTR:   3 errors encountered during TEST_HDF5_ATTR tests  
% TEST_HDF5_DATA:   NO errors encountered during TEST_HDF5_DATA tests  
% H5G_CREATE: Bad value
% TEST_HDF5_OBJ_INFO:   NO errors encountered during TEST_HDF5_OBJ_INFO tests  
% TEST_HDF5_COMP:   NO errors encountered during TEST_HDF5_COMP tests  
% H5G_CREATE: Bad value
% TEST_HDF5: =======================================================
% TEST_HDF5: =                                                     =
% TEST_HDF5: =  3.00000 errors encountered during TEST_HDF5 tests  =
% TEST_HDF5: =                                                     =
% TEST_HDF5: =======================================================

slayoo avatar Nov 05 '24 23:11 slayoo

I also noticed this we need to fix it and release v1.1.1 asap

pjb7687 avatar Nov 06 '24 01:11 pjb7687

Duplicate of #1883 and could be related with: #1899?

slayoo avatar Nov 06 '24 09:11 slayoo

@slayoo #1904 shows NO ERRORS on macos-12 (when did you see an error ?)

So on osx 14 = ARM64 = Apple Silicon = M1/M2/M3 the culprit could be that the HDF5 library has the same problem as GDL, i.e., #1881

I have no currently a M1 machine at hand to handle #1881 but it would serve nothing if the HDF5 library is rotten there...

GillesDuvert avatar Nov 06 '24 10:11 GillesDuvert

@slayoo https://github.com/gnudatalanguage/gdl/pull/1904 shows NO ERRORS on macos-12 (when did you see an error ?)

Yesterday's macos-12 build for the master branch has test_hdf5 failed: https://github.com/gnudatalanguage/gdl/actions/runs/11681438801/job/32526479047

% TEST_HDF5_BYTE_ARRAY: 
  NO errors encountered during TEST_HDF5_BYTE_ARRAY tests  
% TEST_HDF5_STRING:   NO errors encountered during TEST_HDF5_STRING tests  
h5dump error: unable to open attribute "attr-02"
% TEST_HDF5_ATTR:   3 errors encountered during TEST_HDF5_ATTR tests  
% TEST_HDF5_DATA:   NO errors encountered during TEST_HDF5_DATA tests  
% H5G_CREATE: Bad value

slayoo avatar Nov 06 '24 17:11 slayoo

I get the test_hdf5.pro error , too on a local build in Arch. It turns out to be an error in wait() within spawn. I made a small change in basic_pro.cpp:

diff --git a/src/basic_pro.cpp b/src/basic_pro.cpp
index 87c0e005..33a5c28f 100644
--- a/src/basic_pro.cpp
+++ b/src/basic_pro.cpp
@@ -2142,7 +2142,9 @@ static DWORD launch_cmd(BOOL hide, BOOL nowait,
 
         // wait until child terminates
         int status;
-        pid_t wpid = wait(&status);
+        if (wait(&status) == -1) {
+          Warning(DString("SPAWN: Error waiting for child process: ") + strerror(errno));
+        }
 
         if (exit_statusKeyword)
           e->SetKW(exit_statusIx, new DLongGDL(status >> 8));

to track down whats happening. At least in my local build, wait() returns -1 and sets errno to ECHILD, this is what this small change reveals.

Since test_hdf5.pro uses spawn, and status does not get set to the return value of the spawned command due to the error in wait(), the test fails on my maschine.

However, I could not yet figure why I get ECHILD in the first place.

Maybe someone else has a clue, I'll try to dig into this a bit more, too.

Best Jan

BTW: In https://github.com/gnudatalanguage/gdl/pull/1913 build on MacOS 14 fails with this error, too: https://github.com/gnudatalanguage/gdl/actions/runs/11942416687/job/33289421103?pr=1913

jkohnert avatar Nov 20 '24 23:11 jkohnert

As I mentioned in #1910 we do have a trouble in SPAWN for wait()

I am not sure it is the best way to just make a small patch without trying to do a good management of wait() and SPAWN ... But I am too bad in C to understand https://man7.org/linux/man-pages/man2/waitpid.2.html

I also check that this is the only occurrence of wait()

alaingdl avatar Nov 21 '24 07:11 alaingdl

I tried waitpid(pid, &status, 0) while debugging, this doesn't change anything... There must be something else broken. :sob:

jkohnert avatar Nov 21 '24 11:11 jkohnert

@alaingdl Got the reason: The culprit is setting SIGCHLD to SIG_IGN (gdl.cpp, line 519). This was introduced in #1870 (at least according to the commit message). However, in doing so, a wait() call will result in waiting for all children to finish and then return -1 (as documented in https://man7.org/linux/man-pages/man2/wait.2.html). Commenting this line restores correct behaviour for wait() at least on my local maschine.

We might have another problem for the other failing tests on Apple, though.

Best, Jan

jkohnert avatar Nov 23 '24 23:11 jkohnert

@jkohnert bravo for the find! When I made #1870 I was not aware SIG_IGN would influence SPAWN (silly, in retrospect, IDL_IDLBridge being a more refined method of gdl respawning itself). Not clear if it will be safe to locally change SIGCHLD handling during the SPAWN command, though. TBC.

GillesDuvert avatar Nov 25 '24 16:11 GillesDuvert

closing as we have 2 different issues there. One solved (spwan). The other not easily solvable (arm64 different math). Opening another issue.

GillesDuvert avatar Nov 28 '24 10:11 GillesDuvert