SparseArrays.jl
SparseArrays.jl copied to clipboard
UMFPACK, CHOLMOD, and SPQR configuration audit
We should audit the UMFPACK build and configuration to ensure those issues are not causing https://github.com/JuliaSparse/SparseArrays.jl/issues/147
https://github.com/DrTimothyAldenDavis/SuiteSparse/blob/master/SuiteSparse_config/SuiteSparse_config.mk#L272
-D'LONGBLAS=long' or -DLONGBLAS='long long' defines the integers used by LAPACK and the BLAS (defaults to 'int')
In the build script, we have:
https://github.com/JuliaPackaging/Yggdrasil/blob/cc75959bcf85b016625d9e528db277fd4d3b32d9/S/SuiteSparse/SuiteSparse/build_tarballs.jl#L37
In Clang.jl generator, we have:
https://github.com/JuliaSparse/SparseArrays.jl/blob/797406962fcf277615cbfc210491766657125cdd/gen/generator.jl#L48
We need to double-check these flags and make sure they are correctly and effectively configured.
[Updated by @ViralBShah to use Tim Davis' repo]
Not sure whether this is an issue but we should be using https://github.com/DrTimothyAldenDavis/SuiteSparse now. I believe that's what we use in BB. That other repo is quite old.
I had recently audited these flags, and concluded that they were correctly configured for win64. I believe if we got this wrong, we would have been seeing SuiteSparse crashes all over.
The flags in the generator should be consistent with the SuiteSparse build flags, and the wrappers - and IMO, they are all good.
These comments in SuiteSparse_config.h are also relevant.
* SuiteSparse_config.h provides the definition of the long integer. On most
* systems, a C program can be compiled in LP64 mode, in which long's and
* pointers are both 64-bits, and int's are 32-bits. Windows 64, however, uses
* the LLP64 model, in which int's and long's are 32-bits, and long long's and
* pointers are 64-bits.
*
* SuiteSparse packages that include long integer versions are
* intended for the LP64 mode. However, as a workaround for Windows 64
* (and perhaps other systems), the long integer can be redefined.
*
* If _WIN64 is defined, then the __int64 type is used instead of long.
*
* The long integer can also be defined at compile time. For example, this
* could be added to SuiteSparse_config.mk:
*
* CFLAGS = -O -D'SuiteSparse_long=long long' \
* -D'SuiteSparse_long_max=9223372036854775801' -D'SuiteSparse_long_idd="lld"'
*
* This file defines SuiteSparse_long as either long (on all but _WIN64) or
* __int64 on Windows 64. The intent is that a SuiteSparse_long is always a
* 64-bit integer in a 64-bit code. ptrdiff_t might be a better choice than
* long; it is always the same size as a pointer.
Our default BLAS on 64-bit is ILP64, which is why we use -DLONGBLAS='long long' on all 64-bit systems, and SuiteSparse_long also gets correctly picked up on Win64 as long long.
I tried looking for places where SuiteSparse might expect long and receive long long on Windows or vice versa, but everything looked pretty consistent. Someone should look at it with fresh eyes.
One other thought - is it possible that we are using freed memory somewhere - like something is getting finalized or GC'ed or manually freed before it ought to?
I thought that was most likely. But I'm not sure why GC would be doing this only on one plat. Switching to Refs should fix that also(?).
Aren't things like Numeric Refs to pointers that have the data? So even if the Ref is safe, the thing it points to could be freed (in theory). I do think it is unlikely, since such things would have surely resulted in many complaints.
Who's the expert on differences between Windows and other plats?
Who's the expert on differences between Windows and other plats?
@vtjnash @staticfloat @giordano @keno
The recent spate of failures have been on CHOLMOD as well.
Example: https://build.julialang.org/#/builders/63/builds/7320/steps/5/logs/stdio
This is being worked on.
The last I spoke with @vtjnash before @keno's comment here, was that it looks like the crash is due to SPQR which seems to be smashing the stack.