charm icon indicating copy to clipboard operation
charm copied to clipboard

ScotchLB occasionally FPEs

Open moxcodes opened this issue 4 years ago • 3 comments

When using ScotchLB for our SpECTRE tests, we have found the occasional edge-case in which a floating point exception is generated from somewhere within the scotch internals. By performing a gdb backtrace, we managed to find that (perhaps unsurprisingly) that the FPE is generated with in the SCOTCH_graphPart call in ScotchLB.C. Beyond that, I wasn't able to understand the Scotch code path enough to grasp what part of the graph partition was failing.

We are using AtSync-based load balancing in these tests, and the error appears to be largely invariant on the number of cores and number of elements we use. The only thing that reliably makes the difference between whether the FPE appears or not is the number of time steps we allow our code to run before it goes into a global synchronization point and calls AtSync().

I've attached a patch I have used to generate tracing output for the graph that Charm is providing to Scotch for the graph partition, and the corresponding output from one of the smaller runs that I have attempted that cause the FPE.

scotch_failures_20210611.txt scotch_lb_tracing_out_patch.txt

I attempted to reproduce the crash with the charm load-balancing tests, but was unable to cause the FPE, even when using the charm tests with extremely long durations between balancing operations. The charm test I attempted did have exactly uniform load values printed in the tracing output for the graph passed to Scotch, so perhaps it is important that we have a fairly nonuniform load in the case that generates the FPE?

Let me know if there's any other details or run attempts you'd like to see.

moxcodes avatar Jun 24 '21 00:06 moxcodes

Hmm, I tried to reproduce this in a little test program I made (https://github.com/rbuch/scotch-test) that reads in the data from the file you provided and runs the partitioner, but it runs to completion without any errors for me. I assume 32 processors and constant load for each object (since it's not specified in the file you provided), so perhaps one of those assumptions needs to be removed to trigger the FPE, I'll keep playing around with it for a bit.

rbuch avatar Jun 24 '21 15:06 rbuch

Hi @rbuch -- I've done a bit more digging into the problem. I think I didn't provide quite enough information on my previous tracing output to find the issue. I've updated the tracing out to extract the velotab (which I understand to be the load measurements for the vertices in the graph) as well, and I think the problem is that depending on the load-balancing measurements, occasionally an element has an entry in velotab equal to 1.

I wouldn't have though that would cause problems, but tracing the control flow through the backtrace:

#0  _SCOTCHbgraphInit2 (grafptr=0x7fffffffa290, domndist=1, domnwght0=1, domnwght1=2, 
    vfixload0=0, vfixload1=0) at bgraph.c:176
#1  0x00000000018e6405 in _SCOTCHbgraphInit (actgrafptr=actgrafptr@entry=0x7fffffffa290, 
    srcgrafptr=srcgrafptr@entry=0x7fffffffa220, archptr=0x7fffffffb2c8, 
    domnsubtab=domnsubtab@entry=0x7fffffffa1d0, vflowgttab=vflowgttab@entry=0x7fffffffa1c8)
    at bgraph.c:131
#2  0x00000000018e39bc in _SCOTCHkgraphMapRbBgraph (dataptr=dataptr@entry=0x7fffffffaa90, 
    actgrafptr=actgrafptr@entry=0x7fffffffa290, srcgrafptr=srcgrafptr@entry=0x7fffffffa220, 
    srcmappptr=srcmappptr@entry=0x7fffffffb320, domnsubtab=domnsubtab@entry=0x7fffffffa1d0, 
    vflowgttab=vflowgttab@entry=0x7fffffffa1c8) at kgraph_map_rb.c:582
#3  0x00000000018e5cdb in kgraphMapRbPart2 (dataptr=dataptr@entry=0x7fffffffaa90, 
    srcgrafptr=srcgrafptr@entry=0x7fffffffa460, srcparttax=0x7fffdcc20430 "", 
    indpartval=indpartval@entry=0 '\000', indvertnbr=<optimized out>, domnnum=13, vflonbr=0, 
    vflotab=0x0) at kgraph_map_rb_part.c:229
#4  0x00000000018e6045 in kgraphMapRbPart2 (dataptr=dataptr@entry=0x7fffffffaa90, 
    srcgrafptr=srcgrafptr@entry=0x7fffffffa6a0, srcparttax=<optimized out>, 
    indpartval=indpartval@entry=0 '\000', indvertnbr=<optimized out>, domnnum=11, vflonbr=0, 
    vflotab=0x0) at kgraph_map_rb_part.c:294
#5  0x00000000018e6045 in kgraphMapRbPart2 (dataptr=dataptr@entry=0x7fffffffaa90, 
    srcgrafptr=srcgrafptr@entry=0x7fffffffb260, srcparttax=<optimized out>, 
    indpartval=indpartval@entry=0 '\000', indvertnbr=<optimized out>, domnnum=1, vflonbr=0, 
    vflotab=0x0) at kgraph_map_rb_part.c:294
#6  0x00000000018e6045 in kgraphMapRbPart2 (dataptr=0x7fffffffaa90, 
    srcgrafptr=<optimized out>, srcparttax=srcparttax@entry=0x0, 
    indpartval=indpartval@entry=0 '\000', indvertnbr=<optimized out>, 
    domnnum=domnnum@entry=0, vflonbr=0, vflotab=0x0) at kgraph_map_rb_part.c:294
#7  0x00000000018e61c2 in _SCOTCHkgraphMapRbPart (dataptr=<optimized out>, 
    grafptr=<optimized out>, vflonbr=<optimized out>, vflotab=<optimized out>)
    at kgraph_map_rb_part.c:338
#8  0x00000000018e37b4 in _SCOTCHkgraphMapRb (grafptr=0x7fffffffb260, 
    paraptr=<optimized out>) at kgraph_map_rb.c:152
#9  0x00000000018d26bf in _SCOTCHkgraphMapSt (grafptr=grafptr@entry=0x7fffffffb260, 
    strat=0x7fffefe13790) at kgraph_map_st.c:386
#10 0x00000000018e2af4 in kgraphMapMl2 (grafptr=grafptr@entry=0x7fffffffb260, 
    paraptr=0x7fffefe13738) at kgraph_map_ml.c:267
#11 0x00000000018e2dee in _SCOTCHkgraphMapMl (grafptr=0x7fffffffb260, 
    paraptr=<optimized out>) at kgraph_map_ml.c:430
#12 0x00000000018d26bf in _SCOTCHkgraphMapSt (grafptr=0x7fffffffb260, strat=0x7fffefe13720)
    at kgraph_map_st.c:386
#13 0x00000000018d2689 in _SCOTCHkgraphMapSt (grafptr=0x7fffffffb260, strat=0x7fffefe14210)
    at kgraph_map_st.c:293
#14 0x00000000018d2689 in _SCOTCHkgraphMapSt (grafptr=grafptr@entry=0x7fffffffb260, 
    strat=strat@entry=0x7fffefe142f0) at kgraph_map_st.c:293
#15 0x00000000018cc20d in graphMapCompute2 (grafptr=grafptr@entry=0x7fffffffb560, 
    mappptr=mappptr@entry=0x7fffffffb470, mapoptr=mapoptr@entry=0x0, 
    emraval=emraval@entry=1, vmlotab=vmlotab@entry=0x0, vfixnbr=vfixnbr@entry=0, 
    straptr=straptr@entry=0x7fffffffb558) at library_graph_map.c:274
#16 0x00000000018cc2d7 in SCOTCH_graphMapCompute (grafptr=grafptr@entry=0x7fffffffb560, 
    mappptr=mappptr@entry=0x7fffffffb470, straptr=straptr@entry=0x7fffffffb558)
    at library_graph_map.c:297
#17 0x00000000018cc30b in SCOTCH_graphMap (grafptr=grafptr@entry=0x7fffffffb560, 
    archptr=archptr@entry=0x7fffffffb4b0, straptr=straptr@entry=0x7fffffffb558, 
    parttab=parttab@entry=0x7fffefe53340) at library_graph_map.c:391
#18 0x00000000018cc370 in SCOTCH_graphPart (grafptr=0x7fffffffb560, partnbr=31, 
    straptr=0x7fffffffb558, parttab=0x7fffefe53340) at library_graph_map.c:509
#19 0x0000000000fbfc3b in ScotchLB::work(BaseLB::LDStats*) ()
#20 0x00000000019be080 in CentralLB::Strategy(BaseLB::LDStats*) ()
#21 0x00000000019c6704 in CentralLB::LoadBalance() ()
#22 0x00000000019c699c in CkIndex_CentralLB::_call_LoadBalance_void(void*, void*) ()
#23 0x000000000190b5de in _processHandler(void*, CkCoreState*) ()
#24 0x0000000001a1d597 in CsdScheduleForever ()
#25 0x0000000001a1dc65 in CsdScheduler ()
#26 0x0000000001a1807e in ConverseRunPE(int) ()
#27 0x0000000001a1a288 in ConverseInit ()
#28 0x00000000018f7c27 in charm_main ()
#29 0x00007ffff0929555 in __libc_start_main () from /lib64/libc.so.6
#30 0x0000000000c30529 in _start ()

It appears that we're getting inputs to bgraph.c:bgraphInit2(in Scotch) of (from tracing output I added to scotch as well, but similarly available from the first line of the backtrace):

--- Scotch Debug info ---
 domndist: 1
 domnwght0: 1
 domnwght1: 2
 vfixload0: 0
 vfixload1: 0
 grafptr->s.velosum: 1

and scotch performs casts such that a load average is calculated as (from bgraph.c:159):

  grafptr->compload0avg  = (Gnum) (((double) (grafptr->s.velosum + vfixload0 + vfixload1) * (do
uble) domnwght0) / (double) (domnwght0 + domnwght1)) - vfixload0;
...
  grafptr->bbalval       = (double) grafptr->compload0dlt / (double) grafptr->compload0avg;

which appears to have the implicit assumption that (grafptr->s.velosum + vfixload0 + vfixload1) * domnwght0 >= dmnwght0 + dmnwght1) so that the calculation doesn't truncate to 0 (I think Gnum is an integer type). The division of the truncated value in the above quoted line appears to be where the FPE is generated.

As far as I could tell, the call path in question seems to always produce vfixload0 and vfixload1 (also supported by other tracing out), so it seems like the velosum for each vertex may need to have some nontrivial lower bound, but I'm not sure the best way to solve the problem in charm.

I've attached a new version of the patch and output (.txt extension to make github happy) scotch_tracing_out_patch.txt failing_trace_3.txt

let me know if that's still unhelpful or if there's any further information I can help provide. Cheers!

moxcodes avatar Jul 08 '21 17:07 moxcodes

Okay, great, thanks for the deeper dive, this is very helpful. I'll take a look at reproducing this and seeing how we can protect against this situation arising from the Charm++ side.

rbuch avatar Jul 08 '21 18:07 rbuch