ForTrilinos icon indicating copy to clipboard operation
ForTrilinos copied to clipboard

Implicit linear solver fails with Intel compiler

Open aprokop opened this issue 6 years ago • 7 comments

@sethrj @tjfulle

Nate from LANL discovered that. I can reproduce on condo with Intel 18. GCC is fine.

Backtrace:

(gdb) bt
Program received signal SIGSEGV, Segmentation fault.
fortpetra::c_f_pointer_fortpetraoperator (clswrap=..., fptr=0x2ae360058b4810) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/tpetra/src/fortpetra.f90:6443
6443      fptr => handle%data
#0  fortpetra::c_f_pointer_fortpetraoperator (clswrap=..., fptr=0x2ae360058b4810) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/tpetra/src/fortpetra.f90:6443
#1  0x00002aaaaef4ccfc in fortpetra::swigd_fortpetraoperator_getdomainmap (fresult=..., fself=...) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/tpetra/src/fortpetra.f90:6497
#2  0x00002aaaaf031714 in ForTpetraOperator::getDomainMap (this=0x2aaaaf2ea8a8 <fortpetra_mp_c_f_pointer_fortpetraoperator_$HANDLE.0.137>) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/tpetra/src/fortpetraFORTRAN_wrap.cxx:746
#3  0x00002aaaaaed2a52 in ForTrilinos::TrilinosSolver::setup_solver (this=0x2aaaaf2ea8a0 <fortpetra_mp_c_f_pointer_fortpetraoperator_$FSELF_PTR.0.137>, paramList=...) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/simple/src/solver_handle.cpp:62
#4  0x00002aaaaaec697b in _wrap_TrilinosSolver_setup_solver (farg1=0x2aaaaf2ea8a0 <fortpetra_mp_c_f_pointer_fortpetraoperator_$FSELF_PTR.0.137>, farg2=0x2aaaaf2ea8a8 <fortpetra_mp_c_f_pointer_fortpetraoperator_$HANDLE.0.137>) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/simple/src/fortrilinosFORTRAN_wrap.cxx:737
#5  0x00002aaaaaec5dc2 in fortrilinos::swigf_trilinossolver_setup_solver (self=0x2ae360058b4810, paramlist=0x2ae360058b4810) at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/simple/src/fortrilinos.f90:331
#6  0x0000000000412a24 in main () at /home/xap/code/trilinos-fortrilinos/packages/ForTrilinos/src/simple/test/test_simple_solver_handle.f90:317
#7  0x00000000004108de in main ()

aprokop avatar Aug 04 '18 02:08 aprokop

@sethrj Do you have a simple ioc example to test with Intel (outside of ForTrilinos)?

aprokop avatar Aug 04 '18 02:08 aprokop

Yes, if you look at Examples/fortran/director inside the "callback" branch, that should be what you need.

sethrj avatar Aug 06 '18 18:08 sethrj

Ok...in the director example, I try the following with the gcc/6.4.0

swig -fortran -c++ director.i
g++ -c director.cxx director_wrap.cxx
ar rvs director.a director.o director_wrap.o
gfortran -c director.f90
gfortran runme.f90 director.o director.a -lstdc++

This compiles fine, and when run produces the following output:

[sn-fey2] director - ./a.out
 test_subclass
 Transformed: 'whee'
 Transformed: [whee]
 test_transform
 Transformed: "whiskey", and "tango", and "foxtrot", and "sierra", and "juliet"
 Joined with commas: "whiskey", "tango", "foxtrot", "sierra", "juliet"
 test_actual
 Transformed: 'whiskey', and 'tango', and 'foxtrot', and 'sierra', and 'juliet'
 Joined with commas: 'whiskey', 'tango', 'foxtrot', 'sierra', 'juliet'
 Joined with default: 'whiskey', 'tango', 'foxtrot', 'sierra', 'juliet'
 Joined with commas: [whiskey], [tango], [foxtrot], [sierra], [juliet]
 Joined with default: [whiskey]><[tango]><[foxtrot]><[sierra]><[juliet]
 Transformed: "whiskey", and "tango", and "foxtrot", and "sierra", and "juliet"
 Transformed: !whiskey!, and !tango!, and !foxtrot!, and !sierra!, and !juliet!
 Joined with commas: !whiskey!, !tango!, !foxtrot!, !sierra!, !juliet!

I then blow away the .o, .mod, and .a files and try the following with intel/18.0.2

icpc -c director.cxx director_wrap.cxx
ar rvs director.a director.o director_wrap.o
ifort -c director.f90
ifort runme.f90 director.o director.a -lstdc++

I get the following error:

runme.f90(75): error #8212: Omitted field is not initialized. Field initialization missing:   [SWIGDATA]
  allocate(join, source=SingleJoiner())
^
compilation aborted for runme.f90 (code 1)

mattbement avatar Aug 30 '18 04:08 mattbement

So...putting in stuff like the following let's me get past the compile errors.

  type(SingleJoiner) :: sj
  type(BracketJoiner) :: bj
  ! NOTE: because we're not calling any C functions here, we don't actually
  ! have to call init_FortranJoiner
  write(*,*) "test_subclass"
  allocate(join, source=sj)

However, when I run the resulting executable, I get a segfault:

[sn-fey2] director - ./a.out
 test_subclass
 Transformed: 'whee'
 Transformed: [whee]
 test_transform
 Transformed: "whiskey", and "tango", and "foxtrot", and "sierra", and "juliet"
 Joined with commas: "whiskey", "tango", "foxtrot", "sierra", "juliet"
 test_actual
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
a.out              000000000041CF4D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AB3F1D645E0  Unknown               Unknown  Unknown
a.out              0000000000409A46  Unknown               Unknown  Unknown
a.out              00000000004100E8  Unknown               Unknown  Unknown
a.out              00000000004104BC  Unknown               Unknown  Unknown
a.out              000000000040B7AB  Unknown               Unknown  Unknown
a.out              0000000000407F7A  Unknown               Unknown  Unknown
a.out              0000000000404F25  Unknown               Unknown  Unknown
a.out              00000000004046B2  Unknown               Unknown  Unknown
a.out              0000000000403AEE  Unknown               Unknown  Unknown
libc-2.17.so       00002AB3F1F92C05  __libc_start_main     Unknown  Unknown
a.out              00000000004039E9  Unknown               Unknown  Unknown

Then, for completeness, I go back and build it all again with GCC to make sure I didn't biff something as I was editing runme.f90, and it runs just fine.

mattbement avatar Aug 30 '18 05:08 mattbement

Sorry in advance for a long post. The segfault is happening in c_f_pointer_Joiner. I can run both the gcc and intel versions in totalview to see whats going on. Here's a gcc screenshot: gcc and here's the intel screenshot: intel Note the difference in the representation of clswrap. The intel version seems to be creating a stuct out of ptr, where the gcc version doesn't. I think a result of this is that fself_ptr is nonsense in the intel version, which then causes a segfault a line 695. I could use some help interpreting the significance of this, maybe from @sethrj?

mattbement avatar Aug 31 '18 19:08 mattbement

Minimized the previous comment, as I think it's been overtaken by newer information. In a nutshell, I think there's an intel compiler bug, though I could benefit from another pair of eyes to confirm. If you look at what goes into the swigd_Joiner_transform call in FortranJoiner::transform (in director_wrap.cxx, see below). The arguments are (&self,&arg1) callstack and compare it to what actually arrives in swigd_Joiner_transform (in director.f90, arguments are farg1 and farg2), you see the following. intel2

Note that the two receiving arguments are pointing at the second calling argument. The pointers are pointing to the same memory, and in the case of farg1, the value of farg1%mem has taken the value &arg1->size.

I just tried this in Intel 2019.beta and the problem is still there.

mattbement avatar Aug 31 '18 21:08 mattbement

Ugh. As a general rule of thumb in my experience, "seems like a compiler bug" usually means "I'm depending on undefined behavior being consistent"...

...but given that the gfortran compiler actually had an acknowledged bug there that we found and fixed, you could be right.

But looking again, are you sure that at the breakpoint you're using, the variables have been initialized? It looks like they both might be filled with bogus values to me.

I'll be back in the office on Tuesday; perhaps we could discuss then?

sethrj avatar Sep 01 '18 03:09 sethrj