SparseArrays.jl icon indicating copy to clipboard operation
SparseArrays.jl copied to clipboard

UMFPACK, CHOLMOD, and SPQR configuration audit

Open Gnimuc opened this issue 3 years ago • 13 comments

We should audit the UMFPACK build and configuration to ensure those issues are not causing https://github.com/JuliaSparse/SparseArrays.jl/issues/147

https://github.com/DrTimothyAldenDavis/SuiteSparse/blob/master/SuiteSparse_config/SuiteSparse_config.mk#L272

-D'LONGBLAS=long' or -DLONGBLAS='long long' defines the integers used by LAPACK and the BLAS (defaults to 'int')

In the build script, we have:

https://github.com/JuliaPackaging/Yggdrasil/blob/cc75959bcf85b016625d9e528db277fd4d3b32d9/S/SuiteSparse/SuiteSparse/build_tarballs.jl#L37

In Clang.jl generator, we have:

https://github.com/JuliaSparse/SparseArrays.jl/blob/797406962fcf277615cbfc210491766657125cdd/gen/generator.jl#L48

We need to double-check these flags and make sure they are correctly and effectively configured.

[Updated by @ViralBShah to use Tim Davis' repo]

Gnimuc avatar Jul 13 '22 14:07 Gnimuc

Not sure whether this is an issue but we should be using https://github.com/DrTimothyAldenDavis/SuiteSparse now. I believe that's what we use in BB. That other repo is quite old.

rayegun avatar Jul 13 '22 16:07 rayegun

I had recently audited these flags, and concluded that they were correctly configured for win64. I believe if we got this wrong, we would have been seeing SuiteSparse crashes all over.

The flags in the generator should be consistent with the SuiteSparse build flags, and the wrappers - and IMO, they are all good.

ViralBShah avatar Jul 13 '22 18:07 ViralBShah

These comments in SuiteSparse_config.h are also relevant.

 * SuiteSparse_config.h provides the definition of the long integer.  On most
 * systems, a C program can be compiled in LP64 mode, in which long's and
 * pointers are both 64-bits, and int's are 32-bits.  Windows 64, however, uses
 * the LLP64 model, in which int's and long's are 32-bits, and long long's and
 * pointers are 64-bits.
 *
 * SuiteSparse packages that include long integer versions are
 * intended for the LP64 mode.  However, as a workaround for Windows 64
 * (and perhaps other systems), the long integer can be redefined.
 *
 * If _WIN64 is defined, then the __int64 type is used instead of long.
 *
 * The long integer can also be defined at compile time.  For example, this
 * could be added to SuiteSparse_config.mk:
 *
 * CFLAGS = -O -D'SuiteSparse_long=long long' \
 *  -D'SuiteSparse_long_max=9223372036854775801' -D'SuiteSparse_long_idd="lld"'
 *
 * This file defines SuiteSparse_long as either long (on all but _WIN64) or
 * __int64 on Windows 64.  The intent is that a SuiteSparse_long is always a
 * 64-bit integer in a 64-bit code.  ptrdiff_t might be a better choice than
 * long; it is always the same size as a pointer.

ViralBShah avatar Jul 13 '22 19:07 ViralBShah

Our default BLAS on 64-bit is ILP64, which is why we use -DLONGBLAS='long long' on all 64-bit systems, and SuiteSparse_long also gets correctly picked up on Win64 as long long.

ViralBShah avatar Jul 13 '22 19:07 ViralBShah

I tried looking for places where SuiteSparse might expect long and receive long long on Windows or vice versa, but everything looked pretty consistent. Someone should look at it with fresh eyes.

ViralBShah avatar Jul 13 '22 19:07 ViralBShah

One other thought - is it possible that we are using freed memory somewhere - like something is getting finalized or GC'ed or manually freed before it ought to?

ViralBShah avatar Jul 13 '22 19:07 ViralBShah

I thought that was most likely. But I'm not sure why GC would be doing this only on one plat. Switching to Refs should fix that also(?).

rayegun avatar Jul 13 '22 19:07 rayegun

Aren't things like Numeric Refs to pointers that have the data? So even if the Ref is safe, the thing it points to could be freed (in theory). I do think it is unlikely, since such things would have surely resulted in many complaints.

ViralBShah avatar Jul 13 '22 19:07 ViralBShah

Who's the expert on differences between Windows and other plats?

rayegun avatar Jul 13 '22 20:07 rayegun

Who's the expert on differences between Windows and other plats?

@vtjnash @staticfloat @giordano @keno

ViralBShah avatar Jul 13 '22 20:07 ViralBShah

The recent spate of failures have been on CHOLMOD as well.

Example: https://build.julialang.org/#/builders/63/builds/7320/steps/5/logs/stdio

ViralBShah avatar Aug 03 '22 14:08 ViralBShah

This is being worked on.

Keno avatar Aug 03 '22 18:08 Keno

The last I spoke with @vtjnash before @keno's comment here, was that it looks like the crash is due to SPQR which seems to be smashing the stack.

ViralBShah avatar Aug 10 '22 01:08 ViralBShah