superlu_dist
superlu_dist copied to clipboard
issue with XSDK_INDEX_SIZE=64/_LONGINT with latest (maint)
Sherry,
With the latest superlu_dist (i.e latest 'maint') - I'm seeing the following issue with superlu_dist petsc example.
This issue is caused by 07205cb4705af6040cd3c570b5563bae3e953e5f
Could you take a look at this?
cc: @BarrySmith
thanks,
balay@asterix /home/balay/petsc/src/ksp/ksp/examples/tutorials (master=)
$ mpiexec -n 2 valgrind --tool=memcheck -q ./ex52 -use_superlu_lu
==22675== Invalid write of size 8
==22675== at 0x6B0B6A8: pdgstrf (dlook_ahead_update.c:80)
==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22675== by 0x5F480E2: PCSetUp (precon.c:968)
==22675== by 0x608B587: KSPSetUp (itfunc.c:393)
==22675== by 0x403614: main (ex52.c:316)
==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd
==22675== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==22675== by 0x6ABC99D: superlu_malloc_dist (memory.c:118)
==22675== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155)
==22675== by 0x6B08EF9: pdgstrf (pdgstrf.c:832)
==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22675== by 0x5F480E2: PCSetUp (precon.c:968)
==22675== by 0x608B587: KSPSetUp (itfunc.c:393)
==22675== by 0x403614: main (ex52.c:316)
==22675==
==22675== Invalid read of size 8
==22675== at 0x7582EB0: dgemm_ (in /usr/lib64/libblas.so.3.6.1)
==22675== by 0x6B0B975: pdgstrf (dlook_ahead_update.c:139)
==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22675== by 0x5F480E2: PCSetUp (precon.c:968)
==22675== by 0x608B587: KSPSetUp (itfunc.c:393)
==22675== by 0x403614: main (ex52.c:316)
==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd
==22675== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==22675== by 0x6ABC99D: superlu_malloc_dist (memory.c:118)
==22675== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155)
==22675== by 0x6B08EF9: pdgstrf (pdgstrf.c:832)
==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22675== by 0x5F480E2: PCSetUp (precon.c:968)
==22675== by 0x608B587: KSPSetUp (itfunc.c:393)
==22675== by 0x403614: main (ex52.c:316)
==22675==
==22676== Invalid write of size 8
==22676== at 0x6B0B6A8: pdgstrf (dlook_ahead_update.c:80)
==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22676== by 0x5F480E2: PCSetUp (precon.c:968)
==22676== by 0x608B587: KSPSetUp (itfunc.c:393)
==22676== by 0x403614: main (ex52.c:316)
==22676== Address 0xa600d98 is 0 bytes after a block of size 8 alloc'd
==22676== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==22676== by 0x6ABC99D: superlu_malloc_dist (memory.c:118)
==22676== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155)
==22676== by 0x6B08EF9: pdgstrf (pdgstrf.c:832)
==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22676== by 0x5F480E2: PCSetUp (precon.c:968)
==22676== by 0x608B587: KSPSetUp (itfunc.c:393)
==22676== by 0x403614: main (ex52.c:316)
==22676==
==22676== Invalid read of size 8
==22676== at 0x7582EB0: dgemm_ (in /usr/lib64/libblas.so.3.6.1)
==22676== by 0x6B0B975: pdgstrf (dlook_ahead_update.c:139)
==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22676== by 0x5F480E2: PCSetUp (precon.c:968)
==22676== by 0x608B587: KSPSetUp (itfunc.c:393)
==22676== by 0x403614: main (ex52.c:316)
==22676== Address 0xa600d98 is 0 bytes after a block of size 8 alloc'd
==22676== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==22676== by 0x6ABC99D: superlu_malloc_dist (memory.c:118)
==22676== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155)
==22676== by 0x6B08EF9: pdgstrf (pdgstrf.c:832)
==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124)
==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427)
==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099)
==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139)
==22676== by 0x5F480E2: PCSetUp (precon.c:968)
==22676== by 0x608B587: KSPSetUp (itfunc.c:393)
==22676== by 0x403614: main (ex52.c:316)
==22676==
Norm of error 2.62798 iterations 1
Satish, What does the following mean?
==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd
Sherry
It basically means out of bounds access of that block of memory. But its a bit puzzling to me.
Running without valgrind - I get:
$ mpiexec -n 2 ./ex52 -use_superlu_lu
Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c
Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c
Running in debugger with a breakpont at pdgstrf.c:833 - I see bigu_size in the code below is corrupted.
pdgstrf.c:832
if ( !(bigU = doubleMalloc_dist(bigu_size)) )
The following change appears to fix the problem
diff --git a/SRC/util.c b/SRC/util.c
index 7531b74..3a5f511 100644
--- a/SRC/util.c
+++ b/SRC/util.c
@@ -1155,8 +1155,8 @@ int_t estimate_bigu_size(int_t nsupers,
int_t* xsup = Glu_persist->xsup;
- int ncols = 0; /* Count local number of nonzero columns */
- int ldu = 0; /* Count local max. size of nonzero columns */
+ int_t ncols = 0; /* Count local number of nonzero columns */
+ int_t ldu = 0; /* Count local max. size of nonzero columns */
/*initilize perm_u*/
for (int i = 0; i < nsupers; ++i) perm_u[i] = i;
I have pushed the fix to git.
By the way, what debugger do you use on linux using mpich?
Sherry
On Fri, Oct 21, 2016 at 12:25 PM, Xiaoye S. Li [email protected] wrote:
Hi Satish, Thanks for pin-pointing this! I forgot to test the 64-bit integer support before release. Now, github is down. Once it's back, I will push the correction.
Sherry
On Thu, Oct 20, 2016 at 5:07 PM, Satish Balay [email protected] wrote:
It basically means out of bounds access of that block of memory. But its a bit puzzling to me.
Running without valgrind - I get:
$ mpiexec -n 2 ./ex52 -use_superlu_lu Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c
Running in debugger with a breakpont at pdgstrf.c:833 - I see bigu_size in the code below is corrupted.
pdgstrf.c:832 if ( !(bigU = doubleMalloc_dist(bigu_size)) )
The following change appears to fix the problem
diff --git a/SRC/util.c b/SRC/util.c index 7531b74..3a5f511 100644 --- a/SRC/util.c +++ b/SRC/util.c @@ -1155,8 +1155,8 @@ int_t estimate_bigu_size(int_t nsupers,
int_t* xsup = Glu_persist->xsup;
- int ncols = 0; /* Count local number of nonzero columns */
- int ldu = 0; /* Count local max. size of nonzero columns */
- int_t ncols = 0; /* Count local number of nonzero columns */
- int_t ldu = 0; /* Count local max. size of nonzero columns */
/initilize perm_u/ for (int i = 0; i < nsupers; ++i) perm_u[i] = i;
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/4#issuecomment-255261073, or mute the thread https://github.com/notifications/unsubscribe-auth/ALMq96JWLIv5FsfgRTYaijRs7ZM-O-Udks5q2AIpgaJpZM4KcgWi .
On Oct 20, 2016, at 6:14 PM, xiaoyeli [email protected] wrote:
Satish, What does the following mean?
==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd
It is reading immediately after the end of what you allocated. Usually this means the either you did not allocate enough space or the code, by mistake, is accessing one more entry in an array (or pointer) then it is suppose to.
So if you had
double *a;
a = malloc(100*sizeof(double));
x = a[100];
you would get this message.
Sherry
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
PETSc has parallel debugger support - which works well on linux [should also work on Mac - with Xquartz is installed]
mpiexec -n 2 ./ex1 -start_in_debugger
This runs each proc in a separate gdb session [in xterms]
You should be able to do similar stuff with mpich
https://wiki.mpich.org/mpich/index.php/Frequently_Asked_Questions#Q:Can_I_use.22ddd.22_or_.22gdb.22_to_debug_my_MPI_application.3F
On Tue, 25 Oct 2016, xiaoyeli wrote:
I have pushed the fix to git.
By the way, what debugger do you use on linux using mpich?
Sherry
On Fri, Oct 21, 2016 at 12:25 PM, Xiaoye S. Li [email protected] wrote:
Hi Satish, Thanks for pin-pointing this! I forgot to test the 64-bit integer support before release. Now, github is down. Once it's back, I will push the correction.
Sherry
On Thu, Oct 20, 2016 at 5:07 PM, Satish Balay [email protected] wrote:
It basically means out of bounds access of that block of memory. But its a bit puzzling to me.
Running without valgrind - I get:
$ mpiexec -n 2 ./ex52 -use_superlu_lu Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c
Running in debugger with a breakpont at pdgstrf.c:833 - I see bigu_size in the code below is corrupted.
pdgstrf.c:832 if ( !(bigU = doubleMalloc_dist(bigu_size)) )
The following change appears to fix the problem
diff --git a/SRC/util.c b/SRC/util.c index 7531b74..3a5f511 100644 --- a/SRC/util.c +++ b/SRC/util.c @@ -1155,8 +1155,8 @@ int_t estimate_bigu_size(int_t nsupers,
int_t* xsup = Glu_persist->xsup;
- int ncols = 0; /* Count local number of nonzero columns */
- int ldu = 0; /* Count local max. size of nonzero columns */
- int_t ncols = 0; /* Count local number of nonzero columns */
- int_t ldu = 0; /* Count local max. size of nonzero columns */
/initilize perm_u/ for (int i = 0; i < nsupers; ++i) perm_u[i] = i;
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/4#issuecomment-255261073, or mute the thread https://github.com/notifications/unsubscribe-auth/ALMq96JWLIv5FsfgRTYaijRs7ZM-O-Udks5q2AIpgaJpZM4KcgWi .
Hi Satish, Thanks for pin-pointing this! I forgot to test the 64-bit integer support before release. Now, github is down. Once it's back, I will push the correction.
Sherry
On Thu, Oct 20, 2016 at 5:07 PM, Satish Balay [email protected] wrote:
It basically means out of bounds access of that block of memory. But its a bit puzzling to me.
Running without valgrind - I get:
$ mpiexec -n 2 ./ex52 -use_superlu_lu Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c Malloc fails for dgemm u buff U at line 833 in file /home/balay/petsc/arch-idx64-slu-d/externalpackages/git.superlu_dist/SRC/pdgstrf.c
Running in debugger with a breakpont at pdgstrf.c:833 - I see bigu_size in the code below is corrupted.
pdgstrf.c:832 if ( !(bigU = doubleMalloc_dist(bigu_size)) )
The following change appears to fix the problem
diff --git a/SRC/util.c b/SRC/util.c index 7531b74..3a5f511 100644 --- a/SRC/util.c +++ b/SRC/util.c @@ -1155,8 +1155,8 @@ int_t estimate_bigu_size(int_t nsupers,
int_t* xsup = Glu_persist->xsup;
- int ncols = 0; /* Count local number of nonzero columns */
- int ldu = 0; /* Count local max. size of nonzero columns */
- int_t ncols = 0; /* Count local number of nonzero columns */
- int_t ldu = 0; /* Count local max. size of nonzero columns */
/initilize perm_u/ for (int i = 0; i < nsupers; ++i) perm_u[i] = i;
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/4#issuecomment-255261073, or mute the thread https://github.com/notifications/unsubscribe-auth/ALMq96JWLIv5FsfgRTYaijRs7ZM-O-Udks5q2AIpgaJpZM4KcgWi .
Satish, What does this mean?
==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd
Sherry
On Thu, Oct 20, 2016 at 12:21 PM, Satish Balay [email protected] wrote:
Sherry,
With the latest superlu_dist (i.e latest 'maint') - I'm seeing the following issue with superlu_dist petsc example.
This issue is caused by 07205cb https://github.com/xiaoyeli/superlu_dist/commit/07205cb4705af6040cd3c570b5563bae3e953e5f
Could you take a look at this?
cc: @BarrySmith https://github.com/BarrySmith
thanks,
balay@asterix /home/balay/petsc/src/ksp/ksp/examples/tutorials (master=) $ mpiexec -n 2 valgrind --tool=memcheck -q ./ex52 -use_superlu_lu ==22675== Invalid write of size 8 ==22675== at 0x6B0B6A8: pdgstrf (dlook_ahead_update.c:80) ==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22675== by 0x5F480E2: PCSetUp (precon.c:968) ==22675== by 0x608B587: KSPSetUp (itfunc.c:393) ==22675== by 0x403614: main (ex52.c:316) ==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd ==22675== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) ==22675== by 0x6ABC99D: superlu_malloc_dist (memory.c:118) ==22675== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155) ==22675== by 0x6B08EF9: pdgstrf (pdgstrf.c:832) ==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22675== by 0x5F480E2: PCSetUp (precon.c:968) ==22675== by 0x608B587: KSPSetUp (itfunc.c:393) ==22675== by 0x403614: main (ex52.c:316) ==22675== ==22675== Invalid read of size 8 ==22675== at 0x7582EB0: dgemm_ (in /usr/lib64/libblas.so.3.6.1) ==22675== by 0x6B0B975: pdgstrf (dlook_ahead_update.c:139) ==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22675== by 0x5F480E2: PCSetUp (precon.c:968) ==22675== by 0x608B587: KSPSetUp (itfunc.c:393) ==22675== by 0x403614: main (ex52.c:316) ==22675== Address 0xa606c48 is 0 bytes after a block of size 8 alloc'd ==22675== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) ==22675== by 0x6ABC99D: superlu_malloc_dist (memory.c:118) ==22675== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155) ==22675== by 0x6B08EF9: pdgstrf (pdgstrf.c:832) ==22675== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22675== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22675== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22675== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22675== by 0x5F480E2: PCSetUp (precon.c:968) ==22675== by 0x608B587: KSPSetUp (itfunc.c:393) ==22675== by 0x403614: main (ex52.c:316) ==22675== ==22676== Invalid write of size 8 ==22676== at 0x6B0B6A8: pdgstrf (dlook_ahead_update.c:80) ==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22676== by 0x5F480E2: PCSetUp (precon.c:968) ==22676== by 0x608B587: KSPSetUp (itfunc.c:393) ==22676== by 0x403614: main (ex52.c:316) ==22676== Address 0xa600d98 is 0 bytes after a block of size 8 alloc'd ==22676== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) ==22676== by 0x6ABC99D: superlu_malloc_dist (memory.c:118) ==22676== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155) ==22676== by 0x6B08EF9: pdgstrf (pdgstrf.c:832) ==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22676== by 0x5F480E2: PCSetUp (precon.c:968) ==22676== by 0x608B587: KSPSetUp (itfunc.c:393) ==22676== by 0x403614: main (ex52.c:316) ==22676== ==22676== Invalid read of size 8 ==22676== at 0x7582EB0: dgemm_ (in /usr/lib64/libblas.so.3.6.1) ==22676== by 0x6B0B975: pdgstrf (dlook_ahead_update.c:139) ==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22676== by 0x5F480E2: PCSetUp (precon.c:968) ==22676== by 0x608B587: KSPSetUp (itfunc.c:393) ==22676== by 0x403614: main (ex52.c:316) ==22676== Address 0xa600d98 is 0 bytes after a block of size 8 alloc'd ==22676== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299) ==22676== by 0x6ABC99D: superlu_malloc_dist (memory.c:118) ==22676== by 0x6AE6F8D: doubleMalloc_dist (dmemory_dist.c:155) ==22676== by 0x6B08EF9: pdgstrf (pdgstrf.c:832) ==22676== by 0x6AEB8B4: pdgssvx (pdgssvx.c:1124) ==22676== by 0x58C24EF: MatLUFactorNumeric_SuperLU_DIST (superlu_dist.c:427) ==22676== by 0x5355489: MatLUFactorNumeric (matrix.c:3099) ==22676== by 0x5E618E3: PCSetUp_LU (lu.c:139) ==22676== by 0x5F480E2: PCSetUp (precon.c:968) ==22676== by 0x608B587: KSPSetUp (itfunc.c:393) ==22676== by 0x403614: main (ex52.c:316) ==22676== Norm of error 2.62798 iterations 1
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xiaoyeli/superlu_dist/issues/4, or mute the thread https://github.com/notifications/unsubscribe-auth/ALMq93w382LJiep5piaI2ZyaS91Xrnodks5q179OgaJpZM4KcgWi .