charm Hang in tests/charm++/chkpt when using -tracemode perfReport

Original issue: https://charm.cs.illinois.edu/redmine/issues/1510

Besides this one test, running make test OPTS="-tracemode perfReport" passes all tests.

../../../bin/testrun  ./hello +p4 +restart log  ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.005 seconds.
Charm++> Running in non-SMP mode: numPes 4
Converse/Charm++ Commit ID: v6.8.0-beta1-46-g80fa50245
Trace: traceroot: ./hello
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Received 1 arguments: { |./hello| }
Main's MigCtor. a=987(0x1011404fc), b[0]=654(0x101140500), b[1]=321, old PE number 4
Main's PUPer. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
[3] data on Group 3
[1] data on Group 1
[2] data on Group 2
[3] data on NOdeGroup 3
[1] data on NOdeGroup 1
CHello's PUPer. step=3.
[0] data on Group 0
[2] data on NOdeGroup 2
[0] data on NOdeGroup 0
[0]CkRestartMain done. sending out callback.
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 4 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 5 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 6 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 7 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 8 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 9 done

Apr 17 '17 19:04 stwhite91

Original date: 2017-04-18 21:29:36

I think we can reasonably defer this, unless we think it's indicative of some broader deficiency that users will encounter.

Apr 25 '19 02:04 PhilMiller

Original date: 2017-04-20 19:47:19

Trying this on linux now (instead of darwin) it works, so we should be safe to add "-tracemode perfReport" to netlrts-linux or something on autobuild.

Apr 25 '19 02:04 stwhite91

Original date: 2017-09-20 17:57:54

I was able to reproduce this on Linux, it looks like it's because it the PICS code tries to exit prematurely, which closes an output file that it keeps trying to use:

Fatal error: glibc detected an invalid stdio handle

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig`entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
58	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

Apr 25 '19 02:04 rbuch

Original date: 2017-10-03 17:09:10

Taking a second look at this, are we particularly concerned with tracing support across checkpoint/restart? Do we actually expect resulting performance analysis to be meaningful?

I'm thinking the right resolution may actually be to simply throw an error on attempts to restart from a checkpoint when some tracing module is active.

Alternately, without having looked at the code or run the test myself, is the underlying cause just one of a file not being re-opened after restart?

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 17:40:01

OK, I'm running this now, and actually seeing a crash in the exit path in the first run of ./hello:

(gdb) bt
#0  0x00007ffff6f9c2d5 in _IO_vfprintf_internal (s=0xb05200, format=0x6e83a5 "NEWITER %d %d %d %llu %d\n", ap=ap`entry=0x7fffffffd8a8) at vfprintf.c:1317
#1  0x00007ffff6fa47f7 in __fprintf (stream=<optimized out>, format=<optimized out>) at fprintf.c:32
#2  0x00000000006a3935 in TraceAutoPerfBOC::globalPerfAnalyze (this=0xb02770, msg=0xb04f00) at picsautoperf.C:444
#3  0x00000000006afc15 in CkIndex_TraceAutoPerfBOC::_call_globalPerfAnalyze_CkReductionMsg (impl_msg=0xb04f00, impl_obj_void=0xb02770) at TraceAutoPerf.def.h:2996
#4  0x00000000005822e6 in CkDeliverMessageFree (epIdx=166, msg=0xb04f00, obj=0xb02770) at ck.C:597
#5  0x0000000000582432 in _invokeEntryNoTrace (epIdx=166, env=0xb04eb0, obj=0xb02770) at ck.C:641
#6  0x0000000000582542 in _invokeEntry (epIdx=166, env=0xb04eb0, obj=0xb02770) at ck.C:652
#7  0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=166, env=0xb04eb0, obj=0xb02770) at ck.C:1083
#8  0x00000000005852d9 in CkSendMsgBranchInline (eIdx=166, msg=0xb04f00, destPE=0, gID=..., opts=5) at ck.C:1844
#9  0x0000000000585341 in CkSendMsgBranch (eIdx=166, msg=0xb04f00, pe=0, gID=..., opts=5) at ck.C:1856
#10 0x00000000006a8356 in CProxyElement_TraceAutoPerfBOC::globalPerfAnalyze (this=0x7fffffffdc90, impl_msg=0xb04f00) at TraceAutoPerf.def.h:1201
#11 0x00000000006a26a4 in TraceAutoPerfBOC::gatherSummary (this=0xb02770, msg=0xb04f00) at picsautoperf.C:165
#12 0x00000000006a359d in TraceAutoPerfBOC::getPerfData (this=0xb02770, reductionPE=0, cb=...) at picsautoperf.C:399
#13 0x00000000006af42a in CkIndex_TraceAutoPerfBOC::_call_getPerfData_marshall15 (impl_msg=0xb02ff0, impl_obj_void=0xb02770) at TraceAutoPerf.def.h:2892
#14 0x00000000005822e6 in CkDeliverMessageFree (epIdx=164, msg=0xb02ff0, obj=0xb02770) at ck.C:597
#15 0x0000000000582432 in _invokeEntryNoTrace (epIdx=164, env=0xb02fa0, obj=0xb02770) at ck.C:641
#16 0x0000000000582542 in _invokeEntry (epIdx=164, env=0xb02fa0, obj=0xb02770) at ck.C:652
#17 0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=164, env=0xb02fa0, obj=0xb02770) at ck.C:1083
#18 0x0000000000583b1e in _processForBocMsg (ck=0xa08620, env=0xb02fa0) at ck.C:1101
#19 0x0000000000584093 in _processHandler (converseMsg=0xb02fa0, ck=0xa08620) at ck.C:1263
#20 0x000000000065fd11 in CmiHandleMessage (msg=0xb02fa0) at convcore.c:1678
#21 0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#22 0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#23 0x0000000000574b07 in CkExit () at init.C:950
#24 0x0000000000598dfe in CkCallback::send (this=0xb059f8, msg=0xb059e0) at ckcallback.C:224
#25 0x00000000005f6546 in CkReductionMgr::finishReduction (this=0xb00570) at ckreduction.C:766
#26 0x00000000005f621c in CkReductionMgr::addContribution (this=0xb00570, m=0xb03d50) at ckreduction.C:676
#27 0x00000000005f59f1 in CkReductionMgr::contribute (this=0xb00570, ci=0xb01fc0, m=0xb03d50) at ckreduction.C:444
#28 0x00000000005e3c51 in ArrayElement::contribute (this=0xb01f40, dataSize=4, data=0xb01fe8, type=CkReduction::max_int, cb=..., userFlag=65535) at ckarray.C:396
#29 0x0000000000570d74 in Hello::SayHi (this=0xb01f40) at hello.C:96
#30 0x000000000056cb2a in CkIndex_Hello::_call_SayHi_void (impl_msg=0xb02ff0, impl_obj_void=0xb01f40) at hello.def.h:430
#31 0x00000000005823dc in CkDeliverMessageReadonly (epIdx=140, msg=0xb039b0, obj=0xb01f40) at ck.C:631
#32 0x00000000005a2867 in CkLocRec::invokeEntry (this=0xb02000, obj=0xb01f40, msg=0xb039b0, epIdx=140, doFree=false) at cklocation.C:2030
#33 0x0000000000587d82 in CkMigratable::ckInvokeEntry (this=0xb01f40, epIdx=140, msg=0xb039b0, doFree=false) at ckmigratable.h:72
#34 0x00000000005e7c09 in CkArrayBroadcaster::deliver (this=0xb008c0, bcast=0xb039b0, el=0xb01f40, doFree=false) at ckarray.C:1315
#35 0x00000000005e84f3 in CkArray::recvBroadcast (this=0xb00570, m=0xb039b0) at ckarray.C:1543
#36 0x00000000005ec00d in CkIndex_CkArray::_call_recvBroadcast_CkMessage (impl_msg=0xb039b0, impl_obj_void=0xb00570) at CkArray.def.h:1151
#37 0x00000000005822e6 in CkDeliverMessageFree (epIdx=67, msg=0xb039b0, obj=0xb00570) at ck.C:597
#38 0x0000000000582432 in _invokeEntryNoTrace (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:641
#39 0x0000000000582615 in _invokeEntry (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:659
#40 0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:1083
#41 0x0000000000583b1e in _processForBocMsg (ck=0xa08620, env=0xb03960) at ck.C:1101
#42 0x0000000000584093 in _processHandler (converseMsg=0xb03960, ck=0xa08620) at ck.C:1263
#43 0x000000000065fd11 in CmiHandleMessage (msg=0xb03960) at convcore.c:1678
#44 0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#45 0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#46 0x000000000065851d in ConverseRunPE (everReturn=0) at machine-common-core.c:1298
#47 0x00000000006581c7 in ConverseInit (argc=1, argv=0x7fffffffea98, fn=0x574e41 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1199
#48 0x0000000000573113 in main (argc=1, argv=0x7fffffffea98) at main.C:18

CkpvAccess(fpSummary) is previously closed here:

(gdb) bt
#0  _IO_new_fclose (fp=0xb05200) at iofclose.c:39
#1  0x00000000006a51e2 in traceAutoPerfExitFunction () at picsautoperf.C:730
#2  0x0000000000573ec9 in _exitHandler (env=0xb02fa0) at init.C:560
#3  0x000000000065fd11 in CmiHandleMessage (msg=0xb02fa0) at convcore.c:1678
#4  0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#5  0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#6  0x0000000000574b07 in CkExit () at init.C:950
#7  0x0000000000598dfe in CkCallback::send (this=0xb059f8, msg=0xb059e0) at ckcallback.C:224
#8  0x00000000005f6546 in CkReductionMgr::finishReduction (this=0xb00570) at ckreduction.C:766
#9  0x00000000005f621c in CkReductionMgr::addContribution (this=0xb00570, m=0xb03d50) at ckreduction.C:676
#10 0x00000000005f59f1 in CkReductionMgr::contribute (this=0xb00570, ci=0xb01fc0, m=0xb03d50) at ckreduction.C:444
#11 0x00000000005e3c51 in ArrayElement::contribute (this=0xb01f40, dataSize=4, data=0xb01fe8, type=CkReduction::max_int, cb=..., userFlag=65535) at ckarray.C:396
#12 0x0000000000570d74 in Hello::SayHi (this=0xb01f40) at hello.C:96
#13 0x000000000056cb2a in CkIndex_Hello::_call_SayHi_void (impl_msg=0xb02ff0, impl_obj_void=0xb01f40) at hello.def.h:430
#14 0x00000000005823dc in CkDeliverMessageReadonly (epIdx=140, msg=0xb039b0, obj=0xb01f40) at ck.C:631
#15 0x00000000005a2867 in CkLocRec::invokeEntry (this=0xb02000, obj=0xb01f40, msg=0xb039b0, epIdx=140, doFree=false) at cklocation.C:2030
#16 0x0000000000587d82 in CkMigratable::ckInvokeEntry (this=0xb01f40, epIdx=140, msg=0xb039b0, doFree=false) at ckmigratable.h:72
#17 0x00000000005e7c09 in CkArrayBroadcaster::deliver (this=0xb008c0, bcast=0xb039b0, el=0xb01f40, doFree=false) at ckarray.C:1315
#18 0x00000000005e84f3 in CkArray::recvBroadcast (this=0xb00570, m=0xb039b0) at ckarray.C:1543
#19 0x00000000005ec00d in CkIndex_CkArray::_call_recvBroadcast_CkMessage (impl_msg=0xb039b0, impl_obj_void=0xb00570) at CkArray.def.h:1151
#20 0x00000000005822e6 in CkDeliverMessageFree (epIdx=67, msg=0xb039b0, obj=0xb00570) at ck.C:597
#21 0x0000000000582432 in _invokeEntryNoTrace (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:641
#22 0x0000000000582615 in _invokeEntry (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:659
#23 0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:1083
#24 0x0000000000583b1e in _processForBocMsg (ck=0xa08620, env=0xb03960) at ck.C:1101
#25 0x0000000000584093 in _processHandler (converseMsg=0xb03960, ck=0xa08620) at ck.C:1263
#26 0x000000000065fd11 in CmiHandleMessage (msg=0xb03960) at convcore.c:1678
#27 0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#28 0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#29 0x000000000065851d in ConverseRunPE (everReturn=0) at machine-common-core.c:1298
#30 0x00000000006581c7 in ConverseInit (argc=1, argv=0x7fffffffea98, fn=0x574e41 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1199
#31 0x0000000000573113 in main (argc=1, argv=0x7fffffffea98) at main.C:18

There's some screwiness in the control flow here, that maybe relied on some race condition to work at all initially:

extern "C" void traceAutoPerfExitFunction() {
  if (autoPerfProxy.ckGetGroupID().isZero()) {
    CkContinueExit();
    return;
  }

  /* Starts copying of data */
  if(user_call == 0){  // Do not call them by default if the user is calling them                                                                                                                                                  
    autoPerfProxy.endPhase();
    autoPerfProxy.endStepResumeCb(true, CkMyPe(), CkCallbackResumeThread());
  }

  CkpvAccess(isExit) = true;
  autoPerfProxy.getPerfData(0, CkCallback::ignore );

  if(CkpvAccess(fpSummary)!=NULL){
    fflush(CkpvAccess(fpSummary));
    fclose(CkpvAccess(fpSummary));
  }
}

Note the call to getPerfData, which calls globalPerfAnalyze, which writes to the FILE* in question.

I think the right thing here may be to just not explicitly close that file in the trace module's exit handler, since the underlying C library will take care of that when exit() gets called. I'll try that now.

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 18:00:51

~~https://charm.cs.illinois.edu/gerrit/3097~~ https://github.com/UIUC-PPL/charm/commit/525e9c253b51dcaaea839801085bd3554f1f6b6f tracemode perfReport: don't close file in race with code that will write to it ~~https://charm.cs.illinois.edu/gerrit/3098~~ https://github.com/UIUC-PPL/charm/commit/13d2713a9022c16b1512d2aa691a181ae6a062eb tracemode perfReport: register custom reduce in initnode function, as documented

These two fixes make it feasible to at least reproduce this issue as reported initially, including on multicore/smp builds.

I do still see a hang when attempting to restart from checkpoint, even on 1 PE checkpoint, 1 PE restart:

$ rm -rf log
phil`prowess:/scratch/phil/charm/multicore-linux-x86_64-v6.8.0-84-g329e0989f/tests/charm++/chkpt$ ./hello
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  1 threads
Converse/Charm++ Commit ID: v6.8.0-84-g329e0989f
Trace: traceroot: ./hello
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Running Hello on 1 processors for 8 elements
Charm++ - PICS > Enabled pics autoPerf ......
step 0 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 1 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 2 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
[0] Checkpoint starting in log
step 3 done
Main's PUPer. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
CHello's PUPer. step=3.
Checkpoint to disk finished in 0.463060s, sending out the cb...
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 4 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 5 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 6 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 7 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 8 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 9 done

PICS Data: PEs in group: 1
IDLE%: 0.00
OVERHEAD%: 0.00
UTIL%: 1.00
AVG_ENTRY_DURATION: 0.004332

PICS Data: PEs in group: 1
IDLE%: 0.00
OVERHEAD%: 0.00
UTIL%: 1.00
AVG_ENTRY_DURATION: 0.004332
[Partition 0][Node 0] End of program

$ ./hello +restart log
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  1 threads
Converse/Charm++ Commit ID: v6.8.0-84-g329e0989f
Trace: traceroot: ./hello
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Received 1 arguments: { |./hello| }
Main's MigCtor. a=987(0xa1e57c), b[0]=654(0xa1e580), b[1]=321, old PE number 1
Main's PUPer. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
CHello's PUPer. step=3.
[0] data on Group 0
[0] data on NOdeGroup 0
[0]CkRestartMain done. sending out callback.
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 4 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 5 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 6 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 7 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 8 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 9 done
^C

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 18:11:25

Hang happens in traceAutoPerfExitFunction, at autoPerfProxy.endStepResumeCb(true, CkMyPe(), CkCallbackResumeThread());

I'll see why that callback never gets triggered.

Apr 25 '19 02:04 PhilMiller

Original date: 2017-10-03 19:32:14

So it looks like after the restart, t->getTraceOn() returns false, and on multiple paths, this means that the code does nothing, and the callback simply never gets called. The control flow here is astoundingly convoluted, and I'm beyond my tolerance for wanting to dig through it.

Apr 25 '19 02:04 PhilMiller

charm charm copied to clipboard

Hang in tests/charm++/chkpt when using -tracemode perfReport

charm
charm copied to clipboard