spral icon indicating copy to clipboard operation
spral copied to clipboard

GPU code for the indefinite matrices needs checking/fixing

Open venovako opened this issue 7 years ago • 1 comments

Please check that the indefinite GPU code as such work correctly.

There are indications of unitialised memory usage (by cuda-memcheck's initcheck tool) in the GPU part of the code.

Also, some of ssids_test tests fail when used with a GPU build of SSIDS:

Example 1 (ssids_test output during make check, when redirected to a file ssids_test.log):

================
Testing warnings
================

 * Testing warnings (columns)

 * Testing out of range above............ok
 *    checking answer....................ok
 * Testing out of range below............ok
 *    checking answer....................ok
 * Testing duplicates....................ok
 *    checking answer....................ok
 * Testing out of range and duplicates...ok
 *    checking answer....................ok
 * Testing missing diagonal entry (indef).....ok
 *    checking answer....................ok
 * Testing missing diagonal and out of range..ok
 *    checking answer....................ok
 * Testing arrays min size (zero diag)........ok
 *    checking answer....................ok
 * Testing missing diagonal and duplicate.....ok
 *    checking answer....................ok
 * Testing analyse with structurally singular.ok
 *    checking answer....................ok
 * Testing analyse with structurally singular.ok
 *    checking answer....................ok
 * Testing factor with singular matrix.......
 *    checking answer....................ok
 * Testing factor with match ord no scale....
 *    checking answer....................ok

 * Testing warnings (coord)

 * Testing out of range above............ok
 *    checking answer....................ok
 * Testing analyse struct singular and MC80..ok
 *    checking answer....................ok

======================
Testing errors:
======================

 * Testing bad arguments ssids_analyse (columns)
 * Testing n<0...............................ok
 * Testing ptr with zero component...........ok
 * Testing non-monotonic ptr.................ok
 * Testing all A%row oor.....................ok
 * Testing nemin oor.........................ok
 * Testing order absent......................ok
 * Testing order too short...................ok
 * Testing order out of range above..........ok
 * Testing order out of range below..........ok
 * Testing options%ordering out of range.....ok
 * Testing options%ordering oor..............ok
 * Testing val absent........................ok

 * Testing bad arguments ssids_analyse_coord (coordinate form)
 * Testing order out of range above..........ok
 * Testing order out of range below..........ok
 * Testing options%ordering oor..............ok
 * Testing val absent........................ok
 * Testing order absent......................ok
 * Testing n<0...............................ok
 * Testing ne < 0............................ok
 * Testing all oor...........................ok

 * Testing errors from ssids_factor
 * Testing after analyse error...............ok
 * Testing not calling analyse...............ok
 * Testing ptr absent........................ok
 * Testing row absent........................ok
 * Testing factor with singular matrix.......ok
 * Testing factor with singular matrix (MC64 scale).ok
 * Testing factor psdef with indef...........ok
 * Testing factor psdef with indef, large...ok
 * Testing u oor.............................ok
 * Testing options%scaling=3 no matching.....ok

 * Testing bad arguments ssids_solve
 * Testing solve after factor error..........ok
 * Testing solve out of sequence.............ok
 * Testing job out of range below............ok
 * Testing job out of range above............ok
 * Testing error in x (one rhs)..............ok
 * Testing error in lx.......................ok
 * Testing error in nrhs.....................ok

 * Testing bad arguments ssids_enquire_posdef
 * Testing call to enquire after error.......ok
 * Testing call to enquire out of seq........ok
 * Testing call to enquire_posdef with indef.ok

 * Testing bad arguments ssids_enquire_indef
 * Testing call to enquire after error.......ok
 * Testing call to enquire out of seq........ok
 * Testing call to enquire_indef with posdef.ok

 * Testing bad arguments ssids_alter
 * Testing call to alter after error.........ok
 * Testing call to alter out of seq..........ok
 * Testing call to alter_indef with posdef...ok

=====================
Testing special cases
=====================
 * Testing n = 0 (CSC)...................ok
 * Testing n = 0 (coord).................ok

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x3ad843265f in ???
#1  0x4766b4 in _ZN5spral5ssids3cpu19block_ldlt_internal11find_maxlocIdLi32EEEviPKT_iRS4_RiS8_
        at ./src/ssids/cpu/kernels/block_ldlt.hxx:205
#2  0x476ecc in _ZN5spral5ssids3cpu10block_ldltIdLi32EEEviPiPT_iS5_S5_bS4_S4_S3_
        at ./src/ssids/cpu/kernels/block_ldlt.hxx:299
#3  0x47a9c9 in _ZN5spral5ssids3cpu17ldlt_app_internal5BlockIdLi32ENS1_14BuddyAllocatorIiSaIdEEEE6factorINS4_IdS5_EEEEiiPiPdRKNS1_18cpu_factor_optionsERSt6vectorINS1_9WorkspaceESaISG_EERKT_
        at src/ssids/cpu/kernels/ldlt_app.cxx:1014
#4  0x47bd82 in _ZN5spral5ssids3cpu17ldlt_app_internal4LDLTIdLi32ENS2_10CopyBackupIdNS1_14BuddyAllocatorIdSaIdEEEEELb0ELb0ES7_E24run_elim_pivoted_notasksEiiPiPdiSB_RNS2_10ColumnDataIdNS5_IiS6_EEEERS8_RKNS1_18cpu_factor_optionsEidSB_iRSt6vectorINS1_9WorkspaceESaISL_EERKS7_i
        at src/ssids/cpu/kernels/ldlt_app.cxx:1613
#5  0x47958f in _ZN5spral5ssids3cpu17ldlt_app_internal4LDLTIdLi32ENS2_10CopyBackupIdNS1_14BuddyAllocatorIdSaIdEEEEELb0ELb0ES7_E6factorEiiPiPdiSB_RS8_RKNS1_18cpu_factor_optionsENS1_11PivotMethodEidSB_iRSt6vectorINS1_9WorkspaceESaISI_EERKS7_
        at src/ssids/cpu/kernels/ldlt_app.cxx:2367
#6  0x47a47b in _ZN5spral5ssids3cpu17ldlt_app_internal5BlockIdLi32ENS1_14BuddyAllocatorIiSaIdEEEE6factorINS4_IdS5_EEEEiiPiPdRKNS1_18cpu_factor_optionsERSt6vectorINS1_9WorkspaceESaISG_EERKT_
        at src/ssids/cpu/kernels/ldlt_app.cxx:981
#7  0x47e0bc in _ZN5spral5ssids3cpu17ldlt_app_internal4LDLTIdLi32ENS2_10CopyBackupIdNS1_14BuddyAllocatorIdSaIdEEEEELb1ELb0ES7_E16run_elim_pivotedEiiPiPdiSB_RNS2_10ColumnDataIdNS5_IiS6_EEEERS8_RKNS1_18cpu_factor_optionsEidSB_iRSt6vectorINS1_9WorkspaceESaISL_EERKS7_i._omp_fn.13
        at src/ssids/cpu/kernels/ldlt_app.cxx:1323
#8  0x2b91fa9f7469 in gomp_barrier_handle_tasks
        at ../.././libgomp/task.c:1271
#9  0x2b91faa0009b in gomp_team_barrier_wait_end
        at ../.././libgomp/config/linux/bar.c:116
#10  0x2b91fa9fd949 in gomp_thread_start
        at ../.././libgomp/team.c:121
#11  0x3ad9007aa0 in ???
#12  0x3ad84e8aac in ???
#13  0xffffffffffffffff in ???
FAIL ssids_test (exit status: 139)

Example 2 (ssids_test output when run directly, with output to the console):

================
Testing warnings
================

 * Testing warnings (columns)

 * Testing out of range above............ok
 *    checking answer....................ok
 * Testing out of range below............ok
 *    checking answer....................ok
 * Testing duplicates....................ok
 *    checking answer....................ok
 * Testing out of range and duplicates...ok
 *    checking answer....................ok
 * Testing missing diagonal entry (indef).....ok
 *    checking answer....................ok
 * Testing missing diagonal and out of range..ok
 *    checking answer....................ok
 * Testing arrays min size (zero diag)........ok
 *    checking answer....................ok
 * Testing missing diagonal and duplicate.....ok
 *    checking answer....................ok
 * Testing analyse with structurally singular.ok
 *    checking answer....................ok
 * Testing analyse with structurally singular.ok
 *    checking answer....................ok
 * Testing factor with singular matrix.......
 *    checking answer....................ok
 * Testing factor with match ord no scale....
 *    checking answer....................ok

 * Testing warnings (coord)

 * Testing out of range above............ok
 *    checking answer....................ok
 * Testing analyse struct singular and MC80..ok
 *    checking answer....................ok

======================
Testing errors:
======================

 * Testing bad arguments ssids_analyse (columns)
 * Testing n<0...............................ok
 * Testing ptr with zero component...........ok
 * Testing non-monotonic ptr.................ok
 * Testing all A%row oor.....................ok
 * Testing nemin oor.........................ok
 * Testing order absent......................ok
 * Testing order too short...................ok
 * Testing order out of range above..........ok
 * Testing order out of range below..........ok
 * Testing options%ordering out of range.....ok
 * Testing options%ordering oor..............ok
 * Testing val absent........................ok

 * Testing bad arguments ssids_analyse_coord (coordinate form)
 * Testing order out of range above..........ok
 * Testing order out of range below..........ok
 * Testing options%ordering oor..............ok
 * Testing val absent........................ok
 * Testing order absent......................ok
 * Testing n<0...............................ok
 * Testing ne < 0............................ok
 * Testing all oor...........................ok

 * Testing errors from ssids_factor
 * Testing after analyse error...............ok
 * Testing not calling analyse...............ok
 * Testing ptr absent........................ok
 * Testing row absent........................ok
 * Testing factor with singular matrix.......ok
 * Testing factor with singular matrix (MC64 scale).ok
 * Testing factor psdef with indef...........ok
 * Testing factor psdef with indef, large...ok
 * Testing u oor.............................ok
 * Testing options%scaling=3 no matching.....ok

 * Testing bad arguments ssids_solve
 * Testing solve after factor error..........ok
 * Testing solve out of sequence.............ok
 * Testing job out of range below............ok
 * Testing job out of range above............ok
 * Testing error in x (one rhs)..............ok
 * Testing error in lx.......................ok
 * Testing error in nrhs.....................ok

 * Testing bad arguments ssids_enquire_posdef
 * Testing call to enquire after error.......ok
 * Testing call to enquire out of seq........ok
 * Testing call to enquire_posdef with indef.ok

 * Testing bad arguments ssids_enquire_indef
 * Testing call to enquire after error.......ok
 * Testing call to enquire out of seq........ok
 * Testing call to enquire_indef with posdef.ok

 * Testing bad arguments ssids_alter
 * Testing call to alter after error.........ok
 * Testing call to alter out of seq..........ok
 * Testing call to alter_indef with posdef...ok

=====================
Testing special cases
=====================
 * Testing n = 0 (CSC)...................ok
 * Testing n = 0 (coord).................ok
 * Testing zero pivot code ..............ok
 *    checking answer....................ok
 * Testing zero pivot code (block).......ok
 *    checking answer....................ok
 * Testing zero pivot code (column)......ok
 *    checking answer....................ok
 * Testing n>1e5, ne<3.0*n, order=1......ok
 * Testing n>1e5, ne>3.0*n, order=1......ok
 * Testing n>1e5, ne>3.0*n, order=2......ok
 * Testing n<1e5,oxo,m1<1.8*m2,order=1...ok
 *    checking answer....................ok
 * Testing n<1e5,oxo,m1>1.8*m2,order=1...ok
 *    checking answer....................ok
 * Testing n<1e5,oxo,m1>1.8*m2,order=2...ok
 *    checking answer....................ok
 * Testing n=500, posdef, BBD............ok
 *    checking answer....................ok

=======================
Testing random matrices
=======================
 - no.   1 n =     1 nza =       1... num_flops:   0.0 ok...
 - no.   2 n =     2 nza =       2... num_flops:   0.0 ok...
 - no.   3 n =     3 nza =       4... num_flops:   0.0 ok...
 + no.   4 n =     4 nza =       8... num_flops:   0.0 ok...
 - no.   5 n =     5 nza =       6... num_flops:   0.0 ok...
 + no.   6 n =     6 nza =      11... num_flops:   0.0 ok...
 - no.   7 n =     7 nza =      21... num_flops:   0.0 ok...
 + no.   8 n =     8 nza =      29... num_flops:   0.0 ok...
 - no.   9 n =     9 nza =      23... num_flops:   0.0 ok...
 + no.  10 n =    10 nza =      14... num_flops:   0.0 ok...
 - no.  11 n =    11 nza =      48... num_flops:   0.0 ok...
 + no.  12 n =    12 nza =      66... num_flops:   0.0 ok...
 - no.  13 n =    13 nza =      48... num_flops:   0.0 ok...
 + no.  14 n =    14 nza =      25... num_flops:   0.0 ok...
 + no.  15 n =    15 nza =     106... num_flops:   0.0 ok...
 + no.  16 n =    16 nza =     103... num_flops:   0.0 ok...
 + no.  17 n =    17 nza =     128... num_flops:   0.0 ok...
 - no.  18 n =    18 nza =     101... num_flops:   0.0 ok...
 - no.  19 n =    19 nza =      57... num_flops:   0.0 ok...
 - no.  20 n =    20 nza =     127... num_flops:   0.0  f+s fail residual 1d =          NaN
 - no.  21 n =   119 nza =    3699... num_flops:   0.5 ok...
 + no.  22 n =   123 nza =     346... num_flops:   0.0 ok...
 + no.  23 n =   757 nza =   53188... num_flops: 124.2 ok...
 - no.  24 n =   695 nza =  209078... num_flops: 111.6 ok...
 - no.  25 n =     2 nza =       2... num_flops:   0.0 ok...
 - no.  26 n =   933 nza =  185132...
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x3ad843265f in ???
#1  0x4766b4 in _ZN5spral5ssids3cpu19block_ldlt_internal11find_maxlocIdLi32EEEviPKT_iRS4_RiS8_
        at ./src/ssids/cpu/kernels/block_ldlt.hxx:205
#2  0x476ecc in _ZN5spral5ssids3cpu10block_ldltIdLi32EEEviPiPT_iS5_S5_bS4_S4_S3_
        at ./src/ssids/cpu/kernels/block_ldlt.hxx:299
#3  0x47a9c9 in _ZN5spral5ssids3cpu17ldlt_app_internal5BlockIdLi32ENS1_14BuddyAllocatorIiSaIdEEEE6factorINS4_IdS5_EEEEiiPiPdRKNS1_18cpu_factor_optionsERSt6vectorINS1_9WorkspaceESaISG_EERKT_
        at src/ssids/cpu/kernels/ldlt_app.cxx:1014
#4  0x47bd82 in _ZN5spral5ssids3cpu17ldlt_app_internal4LDLTIdLi32ENS2_10CopyBackupIdNS1_14BuddyAllocatorIdSaIdEEEEELb0ELb0ES7_E24run_elim_pivoted_notasksEiiPiPdiSB_RNS2_10ColumnDataIdNS5_IiS6_EEEERS8_RKNS1_18cpu_factor_optionsEidSB_iRSt6vectorINS1_9WorkspaceESaISL_EERKS7_i
        at src/ssids/cpu/kernels/ldlt_app.cxx:1613
#5  0x47958f in _ZN5spral5ssids3cpu17ldlt_app_internal4LDLTIdLi32ENS2_10CopyBackupIdNS1_14BuddyAllocatorIdSaIdEEEEELb0ELb0ES7_E6factorEiiPiPdiSB_RS8_RKNS1_18cpu_factor_optionsENS1_11PivotMethodEidSB_iRSt6vectorINS1_9WorkspaceESaISI_EERKS7_
        at src/ssids/cpu/kernels/ldlt_app.cxx:2367
#6  0x47a47b in _ZN5spral5ssids3cpu17ldlt_app_internal5BlockIdLi32ENS1_14BuddyAllocatorIiSaIdEEEE6factorINS4_IdS5_EEEEiiPiPdRKNS1_18cpu_factor_optionsERSt6vectorINS1_9WorkspaceESaISG_EERKT_
        at src/ssids/cpu/kernels/ldlt_app.cxx:981
#7  0x47e0bc in _ZN5spral5ssids3cpu17ldlt_app_internal4LDLTIdLi32ENS2_10CopyBackupIdNS1_14BuddyAllocatorIdSaIdEEEEELb1ELb0ES7_E16run_elim_pivotedEiiPiPdiSB_RNS2_10ColumnDataIdNS5_IiS6_EEEERS8_RKNS1_18cpu_factor_optionsEidSB_iRSt6vectorINS1_9WorkspaceESaISL_EERKS7_i._omp_fn.13
        at src/ssids/cpu/kernels/ldlt_app.cxx:1323
#8  0x2b5f2b799469 in gomp_barrier_handle_tasks
        at ../.././libgomp/task.c:1271
#9  0x2b5f2b7a209b in gomp_team_barrier_wait_end
        at ../.././libgomp/config/linux/bar.c:116
#10  0x2b5f2b79f949 in gomp_thread_start
        at ../.././libgomp/team.c:121
#11  0x3ad9007aa0 in ???
#12  0x3ad84e8aac in ???
#13  0xffffffffffffffff in ???
Segmentation fault (core dumped)

Both examples were run with all explicit CUDA memory allocations being followed by cudaMemset to all-bits-one (-1), on emerald-devel (3 K20m GPUs).

Note that even setting the GPU memory after allocations is not enough to get the reproducible errors. Sometimes there will be NaNs, sometimes not, from one run to another...

venovako avatar May 31 '17 17:05 venovako

GPU support requires hwloc with CUDA support to be installed, otherwise intermittent segfaults such as the above occur.

jfowkes avatar Sep 20 '21 11:09 jfowkes

Closing this issue, as indeterminate segfaults on GPU seem to be associated with an incorrect installation of hwloc. I have tested the latest SSIDS on an Nvidia A100 with METIS4, METIS5 and 64bit METIS5 and am unable to reproduce any segfaults.

If anyone encounters this again, please create a new issue with details of the segfault.

jfowkes avatar Sep 22 '23 14:09 jfowkes