superlu icon indicating copy to clipboard operation
superlu copied to clipboard

Github workflow: Segmentation fault

Open wo80 opened this issue 2 years ago • 20 comments
trafficstars

I was skipping over the Checks tab of my recent pull request and in the Tests section I saw a couple of Segmentation fault (core dumped). This error is also present in all other pull requests running the workflow.

The first, obvious problem: a test that is producing an error should make the workflow check fail.

But I can also reproduce this error for example in dlinsolx on Windows

./dlinsolx -l 100000000 < ../../EXAMPLE/g20.rua

While this fails on Windows, on Arch linux the above command succeeds, but the test command fails with a segmentation fault:

./d_test -t "SP" -s 5 -l 100000000 -f ../../EXAMPLE/g20.rua

Can anybody reproduce this?

wo80 avatar Aug 04 '23 14:08 wo80

I looked into the configuration of the CMake tests, and besides being overly complex, they also seem to be fundamentally flawed (meaning they aren't testing anything).

First I added a simple check to runtest.cmake:

# execute the test command that was added earlier.
execute_process( COMMAND "${TEST}" 
  OUTPUT_FILE "${OUTPUT}"
  RESULT_VARIABLE RET )

if(NOT RET EQUAL 0)
  message("Error: ${RET}")
endif()

[...]

which prints Error: permission denied. This is because in TESTING/CMakeLists.txt the command set(TEST_LOC ${CMAKE_CURRENT_BINARY_DIR}) returns a directory and so

add_test( ${testName}_SP  "${CMAKE_COMMAND}"
  -DTEST=${TEST_LOC} -t "SP" -s ${s} -l ${l} -f ${TEST_INPUT}
  [...]

will try execute the directory and not the actual test executable inside the directory. Simplifying add_test to

add_test(
  NAME ${testName}_SP
  COMMAND ${target} -t "SP" -s ${s} -l ${l} -f "${TEST_INPUT}")

then reveals the segfault:

Test project /projects/superlu/build/Testing
      Start  1: s_test_9_2_0_LA
 1/24 Test  #1: s_test_9_2_0_LA ..................   Passed    0.02 sec
      Start  2: s_test_9_2_10000000_LA
 2/24 Test  #2: s_test_9_2_10000000_LA ...........   Passed    0.02 sec
      Start  3: s_test_19_2_0_LA
 3/24 Test  #3: s_test_19_2_0_LA .................   Passed    0.03 sec
      Start  4: s_test_19_2_10000000_LA
 4/24 Test  #4: s_test_19_2_10000000_LA ..........   Passed    0.03 sec
      Start  5: s_test_2_0_SP
 5/24 Test  #5: s_test_2_0_SP ....................   Passed    0.06 sec
      Start  6: s_test_2_10000000_SP
 6/24 Test  #6: s_test_2_10000000_SP .............   Passed    0.07 sec
      Start  7: d_test_9_2_0_LA
 7/24 Test  #7: d_test_9_2_0_LA ..................   Passed    0.02 sec
      Start  8: d_test_9_2_10000000_LA
 8/24 Test  #8: d_test_9_2_10000000_LA ...........   Passed    0.02 sec
      Start  9: d_test_19_2_0_LA
 9/24 Test  #9: d_test_19_2_0_LA .................   Passed    0.03 sec
      Start 10: d_test_19_2_10000000_LA
10/24 Test #10: d_test_19_2_10000000_LA ..........   Passed    0.03 sec
      Start 11: d_test_2_0_SP
11/24 Test #11: d_test_2_0_SP ....................   Passed    0.06 sec
      Start 12: d_test_2_10000000_SP
12/24 Test #12: d_test_2_10000000_SP .............***Exception: SegFault  0.01 sec
      Start 13: c_test_9_2_0_LA
13/24 Test #13: c_test_9_2_0_LA ..................   Passed    0.02 sec
      Start 14: c_test_9_2_10000000_LA
14/24 Test #14: c_test_9_2_10000000_LA ...........   Passed    0.03 sec
      Start 15: c_test_19_2_0_LA
15/24 Test #15: c_test_19_2_0_LA .................   Passed    0.06 sec
      Start 16: c_test_19_2_10000000_LA
16/24 Test #16: c_test_19_2_10000000_LA ..........   Passed    0.06 sec
      Start 17: c_test_2_0_SP
17/24 Test #17: c_test_2_0_SP ....................   Passed    0.12 sec
      Start 18: c_test_2_10000000_SP
18/24 Test #18: c_test_2_10000000_SP .............***Exception: SegFault  0.01 sec
      Start 19: z_test_9_2_0_LA
19/24 Test #19: z_test_9_2_0_LA ..................   Passed    0.03 sec
      Start 20: z_test_9_2_10000000_LA
20/24 Test #20: z_test_9_2_10000000_LA ...........   Passed    0.03 sec
      Start 21: z_test_19_2_0_LA
21/24 Test #21: z_test_19_2_0_LA .................   Passed    0.07 sec
      Start 22: z_test_19_2_10000000_LA
22/24 Test #22: z_test_19_2_10000000_LA ..........   Passed    0.07 sec
      Start 23: z_test_2_0_SP
23/24 Test #23: z_test_2_0_SP ....................   Passed    0.15 sec
      Start 24: z_test_2_10000000_SP
24/24 Test #24: z_test_2_10000000_SP .............   Passed    0.16 sec

92% tests passed, 2 tests failed out of 24

Total Test time (real) =   1.23 sec

The following tests FAILED:
         12 - d_test_2_10000000_SP (SEGFAULT)
         18 - c_test_2_10000000_SP (SEGFAULT)
Errors while running CTest

wo80 avatar Aug 05 '23 20:08 wo80

Two more observations on the actual error:

The problem occurs both in debug and release mode and it doesn't seem to behave deterministic. While most of the time I get the segfault, sometimes the tests finish but produce garbage solutions:

    [...]
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(1)=  6.2187e+09
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(2)=  3.0712e+10
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(4)=  3.8735e+09
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(1)=  1.9755e+14
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(2)=  5.4097e+13
    dgssvx:fact=   3, trans=   0, equed=B, n=400, imat=0, test(4)=  2.2462e+13
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(1)=  6.2187e+09
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(2)=  3.0712e+10
    dgssvx:fact=   3, trans=   1, equed=B, n=400, imat=0, test(4)=  3.8735e+09
DGE driver: 92 out of 144 tests failed to pass the threshold

EDIT: To make ctest recognize this as a test failure, the drivers (cdrive.c etc.) should not return 0, but

    return nfail == 0 ? EXIT_SUCCESS : EXIT_FAILURE;

wo80 avatar Aug 05 '23 20:08 wo80

I think it would be best to open 3 separate issues:

  1. The Github workflow should fail when a test fails (shouldn't matter if an actual test condition fails or a segfault occurs)
  2. The CMake test setup needs to be fixed (addressed in PR #112 )
  3. The actual cause of the segfault needs to be investigated

wo80 avatar Aug 05 '23 21:08 wo80

Here's what valgrind has to say about it:

/projects/superlu/build/TESTING$ valgrind ./d_test -t "SP" -s 5 -l 5000000 -f ../../EXAMPLE/g20.rua
==11462== Memcheck, a memory error detector
==11462== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==11462== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==11462== Command: ./d_test -t SP -s 5 -l 5000000 -f ../../EXAMPLE/g20.rua
==11462== 
.. test sparse matrix in file: ../../EXAMPLE/g20.rua
g20, symm permuted by SYMMMD                                            SYM     
==11462== Conditional jump or move depends on uninitialised value(s)
==11462==    at 0x12BCC2: relax_snode (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11C973: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Conditional jump or move depends on uninitialised value(s)
==11462==    at 0x116E87: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Use of uninitialised value of size 8
==11462==    at 0x116E6C: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Use of uninitialised value of size 8
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Invalid read of size 1
==11462==    at 0x116E6C: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462==  Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd
==11462==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==11462==    by 0x116CAD: superlu_malloc (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x1094F3: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== Invalid write of size 1
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462==  Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd
==11462==    at 0x48407B4: malloc (vg_replace_malloc.c:381)
==11462==    by 0x116CAD: superlu_malloc (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x1094F3: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== 
==11462== More than 10000000 total errors detected.  I'm not reporting any more.
==11462== Final error counts will be inaccurate.  Go fix your program!
==11462== Rerun with --error-limit=no to disable this cutoff.  Note
==11462== that errors may occur in your program without prior warning from
==11462== Valgrind, because errors are no longer being displayed.
==11462== 
==11462== 
==11462== Process terminating with default action of signal 11 (SIGSEGV)
==11462==  Bad permissions for mapped region at address 0x4B13FFF
==11462==    at 0x116E73: user_bcopy (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12541C: dexpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x124EC2: dLUMemXpand (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x12300E: dcolumn_bmod (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x11CFA0: dgstrf (in /projects/superlu/build/TESTING/d_test)
==11462==    by 0x10A1B2: main (in /projects/superlu/build/TESTING/d_test)
==11462== 
==11462== HEAP SUMMARY:
==11462==     in use at exit: 5,169,304 bytes in 37 blocks
==11462==   total heap usage: 46 allocs, 9 frees, 5,185,088 bytes allocated
==11462== 
==11462== LEAK SUMMARY:
==11462==    definitely lost: 0 bytes in 0 blocks
==11462==    indirectly lost: 0 bytes in 0 blocks
==11462==      possibly lost: 0 bytes in 0 blocks
==11462==    still reachable: 5,169,304 bytes in 37 blocks
==11462==         suppressed: 0 bytes in 0 blocks
==11462== Rerun with --leak-check=full to see details of leaked memory
==11462== 
==11462== Use --track-origins=yes to see where uninitialised values come from
==11462== For lists of detected and suppressed errors, rerun with: -s
==11462== ERROR SUMMARY: 10000000 errors from 6 contexts (suppressed: 0 from 0)

wo80 avatar Aug 08 '23 11:08 wo80

So, the relevant part is

Invalid write of size 1
   at 0x116E73: user_bcopy
   by 0x12541C: dexpand
   by 0x124EC2: dLUMemXpand
   by 0x12300E: dcolumn_bmod
   by 0x11CFA0: dgstrf
   by 0x10A1B2: main
 Address 0x4f2503f is 1 bytes before a block of size 5,000,000 alloc'd

wo80 avatar Aug 08 '23 11:08 wo80

So I think I tracked down the origin of the problem. Not sure what the correct fix would be, though.

In dmemory.c method dexpand https://github.com/xiaoyeli/superlu/blob/29ea08a6deb67efc3be92068e60cb3605ef3f1fc/SRC/dmemory.c#L573-L577 at the time of calling, expanders[type + 1] is not initialized (where type = LUSUP).

That is due to https://github.com/xiaoyeli/superlu/blob/29ea08a6deb67efc3be92068e60cb3605ef3f1fc/SRC/superlu_enum_consts.h#L37-L38 and dLUMemInit only initializing the first four positions of expanders https://github.com/xiaoyeli/superlu/blob/29ea08a6deb67efc3be92068e60cb3605ef3f1fc/SRC/dmemory.c#L243-L250

EDIT: I only debugged d_test. The same will most likely be the cause in c, s and z versions of the code.

wo80 avatar Aug 11 '23 05:08 wo80

I was wondering about the test type != USUB in https://github.com/xiaoyeli/superlu/blob/29ea08a6deb67efc3be92068e60cb3605ef3f1fc/SRC/dmemory.c#L573-L577

Maybe the whole problem originates in a change of MemType made in https://github.com/xiaoyeli/superlu/commit/52fc55d0397e382f46bdc4fb77445d0e2f4181ea#diff-4964d63c55baaf45c54e2d8b0485e230848b1a9da1d7c9fa40bdc3f77442c08d

Before that change the order was

typedef enum {LUSUP, UCOL, LSUB, USUB, LLVL, ULVL}              MemType;

so testing for USUB would have been correct. But the order changed to

typedef enum {USUB, LSUB, UCOL, LUSUP, LLVL, ULVL, NO_MEMTYPE}  MemType;

and maybe that was just missed in other places of the code, like dexpand.

So testing for type != LUSUP might be the correct fix. But I don't have enough insight into the SuperLU implementation details to be sure :-)

wo80 avatar Aug 11 '23 06:08 wo80

I just tested replacing type != USUB with type != LUSUP. Though this prevents the segfault, it does not prevent some of the tests to fail (assuming the return type fix of the test drivers mentioned in https://github.com/xiaoyeli/superlu/issues/108#issuecomment-1666602452 is applied).

The following tests FAILED:
	  6 - s_test_2_10000000_SP (Failed)
	 12 - d_test_2_10000000_SP (Failed)
	 18 - c_test_2_10000000_SP (Failed)
	 24 - z_test_2_10000000_SP (Failed)

I guess this is as far as I can go without digging into the memory management details of SuperLU.

wo80 avatar Aug 12 '23 09:08 wo80

I can reproduce the issue. Your analysis looks good, I am convinced that this was introduced by the commit you mentioned! Skimming through the commit, the changes to the enum are nowhere motivated and thus most probably wrong.

@xiaoyeli What do you think? Can we (partially) revert 52fc55d? Or do you know which pieces are needed from superlu_dist to fix these examples?

gruenich avatar Aug 12 '23 18:08 gruenich

Resolved it in Master.

xiaoyeli avatar Sep 11 '23 04:09 xiaoyeli

Resolved it in Master.

Alright. Now the remaining two issues mentioned above https://github.com/xiaoyeli/superlu/issues/108#issuecomment-1666605067 should be addressed.

Regarding the Github workflow: since the CMake build script works pretty well, I'd suggest installing cmake and then use it to build and test. Something along the lines (not tested)

   - uses: actions/checkout@v3

    - name: Configure
      run: cmake -B build      

    - name: Build
      run: cmake --build build --parallel

    - name: Test
      run: ctest --test-dir build --output-on-failure

wo80 avatar Sep 11 '23 08:09 wo80

Btw, if you look at the test output of the Github workflow, you still see a bunch of segfaults.

I strongly suggest that you fix the test setup, so the workflow reflects those problems.

wo80 avatar Sep 11 '23 09:09 wo80

I just tested the cmake workflow here https://github.com/wo80/superlu/commit/e494d2ac8c1bb17d475432273cdf8b60ba6f391a and all tests are passing.

@xiaoyeli Please let me know if you want me to merge this into #112

EDIT: I applied the change suggested in https://github.com/xiaoyeli/superlu/issues/108#issuecomment-1666602452 (see https://github.com/wo80/superlu/commit/53794fa76ae7f92c619f5b7940cc08ffa8daae1b) and this makes the tests fail. The segfault is also still present.

wo80 avatar Sep 11 '23 12:09 wo80

After merging the upstream changes, the segfault seems to be fixed. But now the "LA" d_tests fail with

Subprocess aborted***Exception:   0.15 sec
dgstrf info 1
dgstrf info 1
dgstrf info 19
double free or corruption (out)

see https://github.com/wo80/superlu/actions/runs/6146404541/job/16675708872

Failing tests:

 7/24 Test  #7: d_test_9_2_0_LA ..................Subprocess aborted***Exception
 8/24 Test  #8: d_test_19_2_0_LA .................Subprocess aborted***Exception
 9/24 Test  #9: d_test_2_0_SP ....................Passed
10/24 Test #10: d_test_9_2_10000000_LA ...........Subprocess aborted***Exception
11/24 Test #11: d_test_19_2_10000000_LA ..........Subprocess aborted***Exception
12/24 Test #12: d_test_2_10000000_SP .............Passed

Valgrind output:

valgrind --track-origins=yes --leak-check=full ./d_test -t "LA" -n 9 -s 2 -l 0
Memcheck, a memory error detector
Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
Command: ./d_test -t LA -n 9 -s 2 -l 0

dgstrf info 1
dgstrf info 1
dgstrf info 9
Invalid read of size 8
   at 0x10C6BA: dgst01
   by 0x10B4CA: main
 Address 0x5158f78 is 8 bytes before a block of size 72 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x126CBE: doubleCalloc
   by 0x10C19D: dgst01
   by 0x10B4CA: main

[...] more of those errors

dgstrf info 9
dgstrf info 5
Invalid read of size 4
   at 0x10C382: dgst01
   by 0x10B4CA: main
 Address 0x516bc5c is 4 bytes before a block of size 40 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x117FAE: int32Malloc
   by 0x1258B5: dLUMemInit
   by 0x11D857: dgstrf
   by 0x118D40: dgssv
   by 0x10B422: main

Invalid read of size 4
   at 0x10C3FA: dgst01
   by 0x10B4CA: main
 Address 0x516c8bc is 4 bytes before a block of size 648 alloc'd
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x1264D8: dexpand
   by 0x125A0D: dLUMemInit
   by 0x11D857: dgstrf
   by 0x118D40: dgssv
   by 0x10B422: main

[...] more of those errors

dgstrf info 5
All tests for DGE driver passed the threshold (  1158 tests run)

HEAP SUMMARY:
    in use at exit: 319,748 bytes in 578 blocks
  total heap usage: 23,849 allocs, 23,271 frees, 11,336,936 bytes allocated

2,688 (960 direct, 1,728 indirect) bytes in 24 blocks are definitely lost in loss record 15 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x118265: sp_preorder
   by 0x119862: dgssvx
   by 0x10B767: main

74,880 (1,408 direct, 73,472 indirect) bytes in 44 blocks are definitely lost in loss record 21 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x126E18: dCreate_CompCol_Matrix
   by 0x11E487: dgstrf
   by 0x1198E5: dgssvx
   by 0x10B767: main

81,216 (2,464 direct, 78,752 indirect) bytes in 44 blocks are definitely lost in loss record 22 of 23
   at 0x48407B4: malloc (vg_replace_malloc.c:381)
   by 0x117DAA: superlu_malloc
   by 0x127347: dCreate_SuperNode_Matrix
   by 0x11E44C: dgstrf
   by 0x1198E5: dgssvx
   by 0x10B767: main

LEAK SUMMARY:
   definitely lost: 4,832 bytes in 112 blocks
   indirectly lost: 153,952 bytes in 444 blocks
     possibly lost: 0 bytes in 0 blocks
   still reachable: 160,964 bytes in 22 blocks
        suppressed: 0 bytes in 0 blocks
Reachable blocks (those to which a pointer was found) are not shown.
To see them, rerun with: --leak-check=full --show-leak-kinds=all

For lists of detected and suppressed errors, rerun with: -s
ERROR SUMMARY: 3883 errors from 27 contexts (suppressed: 0 from 0)

wo80 avatar Sep 11 '23 15:09 wo80

I think I found the culprit: https://github.com/xiaoyeli/superlu/commit/cf93b7e131d379774a52b184e23548d84eb66e30 https://github.com/xiaoyeli/superlu/blob/90ee45dc836d8f4ff967cad4aa2821809b12fdc9/SRC/dpivotL.c#L134-L146

This was an external contribution merged two days ago. And it's the perfect demonstration, how important a functional CI test setup is. So I'll quote myself from https://github.com/xiaoyeli/superlu/pull/112:

I think it's important to have tests reflecting reality and I think that this should be merged rather sooner than later (even if the issue remains unresolved for now).

wo80 avatar Sep 11 '23 17:09 wo80

I rebased #112 and added the changes from my fix/github-workflow branch (now deleted).

wo80 avatar Sep 11 '23 17:09 wo80

I haven't addressed the above issue in dpivotL.c. I think it's better you @xiaoyeli fix this in a single commit. Then the workflow shouldn't fail anymore.

EDIT: just for demonstration https://github.com/wo80/superlu/actions/runs/6149774863/job/16686331442

wo80 avatar Sep 11 '23 17:09 wo80

https://github.com/xiaoyeli/superlu/commit/f63265a50e6dec635c20f04f7b47e93b0a5c198b seems to fix the issue, the workflow tests are passing.

One question remaining, though:

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[*pivrow] in the section labelled /* Test for singularity */ (see https://github.com/xiaoyeli/superlu/issues/108#issuecomment-1714265717 above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

wo80 avatar Sep 12 '23 10:09 wo80

I think it would be best to open 3 separate issues:

  1. The Github workflow should fail when a test fails (shouldn't matter if an actual test condition fails or a segfault occurs)
  2. The CMake test setup needs to be fixed (addressed in PR https://github.com/xiaoyeli/superlu/pull/112 )
  3. The actual cause of the segfault needs to be investigated
  1. This is addressed in #131, some of your changes and some additions from myself.
  2. Fixed by your commits, merged as #114.
  3. Segfault is also fixed, the checks for #131 are passing.

I would like to extend this list: 4. We need a Windows runner, I created #132 for this to not extend this thread any longer. 5. Your last question should be answered, @xiaoyeli do you know the answer to this question?

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[*pivrow] in the section labelled /* Test for singularity */ (see https://github.com/xiaoyeli/superlu/issues/108#issuecomment-1714265717 above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

gruenich avatar Nov 25 '23 12:11 gruenich

I just pushed the fix to the following:

Comparing dpivotL.c to the (c|s|z)pivotL.c variants, the code re-assigning perm_r[pivrow] in the section labelled / Test for singularity */ (see https://github.com/xiaoyeli/superlu/issues/108#issuecomment-1714265717 above) is disabled in the d variant, but not in the c|s|z variants. Which one is correct?

xiaoyeli avatar Nov 26 '23 17:11 xiaoyeli

@xiaoyeli This can be closed now. Thanks for the fix!

gruenich avatar Jul 22 '24 20:07 gruenich