julia icon indicating copy to clipboard operation
julia copied to clipboard

CI: Stack smash in SuiteSparse

Open Keno opened this issue 3 years ago • 6 comments

We frequently see the win64 builder crash in SuiteSparse. There's some discussion here: https://github.com/JuliaSparse/SparseArrays.jl/issues/147, but I figured I'd open a new issue with some investigation results.

I was able to reproduce this locally with a VM with 32GiB of memory, but not one with 16GiB of memory, which suggests that this may be GC interval or at the very least test-order dependent. I did eventually manage to catch this in the debugger, but by all appearances the stack was smashed. As a result, I would also not put too much credence into any of the stack traces produced by CI.

Keno avatar Jul 31 '22 07:07 Keno

Next attempt: What happens if we build SuiteSparse with -fstack-protector

SuiteSparse.v5.10.1.x86_64-w64-mingw32.tar.gz

Keno avatar Jul 31 '22 07:07 Keno

Stack protector was a bust. I'm pursuing two options in parallel now:

  1. Try the windows version of rr (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/time-travel-debugging-overview) - unfortunately it's much slower than rr.
  2. Try building with msan.

Keno avatar Jul 31 '22 09:07 Keno

cc @Wimmerer

ViralBShah avatar Aug 04 '22 13:08 ViralBShah

I haven't seen this, but I typically don't do any testing on Windows at all for SuiteSparse (UMFPACK, KLU, CHOLMOD, GraphBLAS, etc). What packages in SuiteSparse cause this?

DrTimothyAldenDavis avatar Aug 04 '22 17:08 DrTimothyAldenDavis

The ones tested as part of Julia (SPQR, UMFPACK, CHOLMOD), have all had some sort of random CI errors recently.

Hopefully Keno finds something solid, likely to be with the way we wrap or build SuiteSparse if you've never gotten a report of something like this.

rayegun avatar Aug 04 '22 17:08 rayegun

I have a handle on this. Will post results in a day or two.

Keno avatar Aug 04 '22 19:08 Keno

I am closing this, since this seems to have got resolved - but we should open a new issue (or reopen this) if necessary.

ViralBShah avatar Sep 05 '22 21:09 ViralBShah

It didn't get resolved, it just went away when we put it back into the sysimg, but yeah, we can close the issue, since it's not an active problem.

Keno avatar Sep 05 '22 22:09 Keno