core icon indicating copy to clipboard operation
core copied to clipboard

matchedNodeElmReader alloc crash

Open KennethEJansen opened this issue 4 years ago • 94 comments

Sorry if this has been covered before but how big of a mesh should I be able to stream in with this build settingkjansen@pfe26:~/SCOREC-core/buildMGEN_write3D> more doConfigure14_18 #!/bin/bash -ex

For Chef

cmake
-DCMAKE_C_COMPILER=mpicc
-DCMAKE_CXX_COMPILER=mpicxx
-DSCOREC_CXX_WARNINGS=OFF
-DSCOREC_CXX_OPTIMIZE=ON
-DSCOREC_CXX_SYMBOLS=ON
-DENABLE_ZOLTAN=ON
-DENABLE_SIMMETRIX=ON
-DPCU_COMPRESS=ON
-DSIM_MPI="mpt"
-DSIM_PARASOLID=ON
-DMDS_SET_MAX=1024
-DMDS_ID_TYPE=long
-DIS_TESTING=ON
-DMESHES=/projects/tools/SCOREC-core/meshes
-DCMAKE_INSTALL_PREFIX=$PWD/install
../core

#-DMDS_ID_TYPE=int or long

I have progressively thrown more nodes at it (doubling compute nodes 4 times now even though the number of mesh nodes only doubled from a prior successful run) but keep getting this stack trace

MPT: #7 0x00002aaab019c61a in abort () from /lib64/libc.so.6 MPT: #8 0x000000000049109b in reel_fail ( MPT: format=format@entry=0x49be6e "realloc(%p, %lu) failed") MPT: at /home5/kjansen/SCOREC-core/core/pcu/reel/reel.c:24 MPT: #9 0x0000000000490fc8 in noto_realloc (p=0x5a756a90, MPT: size=18446744056901788296) MPT: at /home5/kjansen/SCOREC-core/core/pcu/noto/noto_malloc.c:60 MPT: #10 0x0000000000490626 in pcu_push_buffer (b=0x3afd84b8, MPT: size=size@entry=18446744056901788288) MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_buffer.c:37 MPT: #11 0x0000000000490917 in pcu_msg_pack (m=m@entry=0x6b79c0 <global_pmsg>, MPT: id=id@entry=639, size=size@entry=18446744056901788288) MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu_msg.c:133 MPT: #12 0x000000000048eb5f in PCU_Comm_Pack (to_rank=to_rank@entry=639, MPT: data=data@entry=0x2357cb30, size=18446744056901788288) MPT: at /home5/kjansen/SCOREC-core/core/pcu/pcu.c:141 MPT: #13 0x000000000047e733 in apf::setCoords (m=m@entry=0x3312d300, MPT: coords=0x2357cb30, nverts=3015292, globalToVert=...) MPT: at /home5/kjansen/SCOREC-core/core/apf/apfConstruct.cc:202 MPT: #14 0x000000000043722c in main (argc=, argv=) MPT: at /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:832

KennethEJansen avatar May 28 '21 11:05 KennethEJansen

I seem to be crashing when trying to stream in a 8,437,865,894 element, 1.9B node mesh. It is a mix of wedges and tets. I was trying to stream it in to 160 Broadwell nodes each running 4 processes and I think that since I told PBS this: #PBS -l select=160:ncpus=4:mpiprocs=4:model=bro it should have given each process 192G/4= 48G so that is is 13.2M elements and 3M nodes per process which is better than what I was last successful with (half the mesh sizes in both elements and nodes ran through on 40 nodes with 1 process per core). I guess this means it is either an index limit or a balloon of memory usage with more processes or PBS not really doing what it should in distributing the processes (though I have sshed to a node and found the expected 4 processes running there).

KennethEJansen avatar May 28 '21 12:05 KennethEJansen

If I am reading the output right,

MPT: #13 0x000000000047e733 in apf::setCoords (m=m@entry=0x3312d300, MPT: coords=0x2357cb30, nverts=3015292, globalToVert=...)

is trying to handle 30M verts which I would not expect to be a problem in terms of memory usage so I think I am hitting an index size issue.

KennethEJansen avatar May 28 '21 12:05 KennethEJansen

What branch/commit were these tests using?

cwsmith avatar May 28 '21 12:05 cwsmith

Discussion notes:

  • This size is concerning: MPT: #12 0x000000000048eb5f in PCU_Comm_Pack (to_rank=to_rank@entry=639, MPT: data=data@entry=0x2357cb30, size=18446744056901788288)
  • n appears to be blowing up here https://github.com/SCOREC/core/blob/c1d05c1a5336549bf6d85f8be8f5d88c373336cc/apf/apfConstruct.cc#L202
  • Gid is an int... https://github.com/SCOREC/core/blob/c1d05c1a5336549bf6d85f8be8f5d88c373336cc/apf/apfConstruct.cc#L11
  • GlobalToVert is an int map ... https://github.com/SCOREC/core/blob/c1d05c1a5336549bf6d85f8be8f5d88c373336cc/apf/apfConvert.h#L33
  • push changes that start conversion of int to long: https://github.com/SCOREC/core/commit/fe15d56fbf51487d0b50d23873e938079b234aa9#diff-8bc0b6c18834c6ae24f81d2049ae3ad8e7859cf8f8f14ad75bf793bf15b40aa2

cwsmith avatar May 28 '21 13:05 cwsmith

MGEN_write3D (https://github.com/SCOREC/core/commit/c1d05c1a5336549bf6d85f8be8f5d88c373336cc)

KennethEJansen avatar May 28 '21 13:05 KennethEJansen

Is there a quick way to convert from long to int? As you feared, there is propagation

[ 38%] Building CXX object mds/CMakeFiles/mds.dir/apfBox.cc.o [ 57%] Built target parma /home5/kjansen/SCOREC-core/core/mds/apfMDS.cc: In function 'void apf::deriveMdlFromManifold(apf::Mesh2*, bool*, int, int ()[5], apf::GlobalToVert&, std::map<int, apf::MeshEntity>&)': /home5/kjansen/SCOREC-core/core/mds/apfMDS.cc:1054:55: error: no matching function for call to 'apf::Mesh2::setIntTag(apf::MeshEntity*&, apf::MeshTag*&, const long int*)' mesh->setIntTag(vit->second, vIDTag, &(vit->first)); ^ In file included from /home5/kjansen/SCOREC-core/core/apf/apfMesh2.h:14:0, from /home5/kjansen/SCOREC-core/core/mds/apfPM.h:14, from /home5/kjansen/SCOREC-core/core/mds/apfMDS.cc:15: /home5/kjansen/SCOREC-core/core/apf/apfMesh.h:245:18: note: candidate: virtual void apf::Mesh::setIntTag(apf::MeshEntity*, apf::MeshTag*, const int*) virtual void setIntTag(MeshEntity* e, MeshTag* tag, int const* data) = 0; ^~~~~~~~~ but I guess I am hoping that with a cast of some sorts on the usage of GlobalToVert, the propagation can be stopped at least for now in routines I am hopefully not using.

KennethEJansen avatar May 28 '21 15:05 KennethEJansen

Note I also had to create a PCU_Max_Long which I hopefully replicated correctly from PCU_Max_Int

KennethEJansen avatar May 28 '21 15:05 KennethEJansen

I got around the casting issue above but am hitting new issues Scanning dependencies of target matchedNodeElmReader [100%] Building CXX object test/CMakeFiles/matchedNodeElmReader.dir/matchedNodeElmReader.cc.o In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0: /home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h: In function '{anonymous}::Gid {anonymous}::getMax(const GlobalToVert&)': /home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:14:36: error: no matching function for call to 'max({anonymous}::Gid&, const long int&)' max = std::max(max, it->first); ^ In file included from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_tree.h:63:0, from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/map:60, from /home5/kjansen/SCOREC-core/core/mds/apfMDS.h:33, from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:3: /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:219:5: note: candidate: template<class _Tp> constexpr const _Tp& std::max(const _Tp&, const _Tp&) max(const _Tp& __a, const _Tp& __b) ^~~ /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:219:5: note: template argument deduction/substitution failed: In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0: /home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:14:36: note: deduced conflicting types for parameter 'const _Tp' ('int' and 'long int') max = std::max(max, it->first); ^ In file included from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_tree.h:63:0, from /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/map:60, from /home5/kjansen/SCOREC-core/core/mds/apfMDS.h:33, from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:3: /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:265:5: note: candidate: template<class _Tp, class _Compare> constexpr const _Tp& std::max(const _Tp&, const _Tp&, _Compare) max(const _Tp& __a, const _Tp& __b, _Compare __comp) ^~~ /nasa/pkgsrc/sles12/2016Q4/gcc6/include/c++/bits/stl_algobase.h:265:5: note: template argument deduction/substitution failed: In file included from /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:6:0: /home5/kjansen/SCOREC-core/core/apf/apfConvertTags.h:14:36: note: deduced conflicting types for parameter 'const _Tp' ('int' and 'long int') max = std::max(max, it->first); ^ /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc: In function 'int main(int, char**)': /home5/kjansen/SCOREC-core/core/test/matchedNodeElmReader.cc:825:73: error: cannot convert '{anonymous}::Gid* {aka int*}' to 'const Gid* {aka const long int*}' for argument '2' to 'void apf::construct(apf::Mesh2*, const Gid*, int, int, apf::GlobalToVert&)' apf::construct(mesh, m.elements, m.localNumElms, m.elementType, outMap);

KennethEJansen avatar May 28 '21 16:05 KennethEJansen

Where do you define the GID typedef? The error makes it look like its defined in an anonymous namespace.

jacobmerson avatar May 28 '21 17:05 jacobmerson

It is/was in apf here:

https://github.com/SCOREC/core/blob/fe15d56fbf51487d0b50d23873e938079b234aa9/apf/apfConvert.h#L32

I'm hacking at this now.

cwsmith avatar May 28 '21 18:05 cwsmith

Reproducer: It is not asmall as I would like but here is a path to the case on the viz nodes for your you to grab (note it it grabs inputs from one directory above in the run line so safest to just grab the dir above were I am running it).

  1. Runs successfully if I roll back to where we were when we tried to make everything with GID long (e.g., here is the hash) git checkout c1d05c1a5336549bf6d85f8be8f5d88c373336cc

  2. here is the case and the successful run

Case DIR /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-12-30/MGEN4mner_noIDX/mner

 mpirun -np 8 /projects/tools/SCOREC-core/buildLastWorking_mner/test/matchedNodeElmReader ../geom3D.cnn_data ../geom3D.crd ../geom3D.match ../geom3D.class ../geom3D.fathers2D NULL ../geom3DHead.cnn outModel.dmg outMesh/
numVerts 2560663
0 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 0 lastVtx 320082
1 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 320082 lastVtx 640164
2 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 640164 lastVtx 960246
7 readMatches numvtx 2560663 localnumvtx 320089 firstVtx 2240574 lastVtx 2560663
5 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1600410 lastVtx 1920492
6 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1920492 lastVtx 2240574
3 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 960246 lastVtx 1280328
4 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1280328 lastVtx 1600410
isMatched 1
seconds to create mesh 117.841
  - verifying tags: fathers2D
mesh verified in 47.882082 seconds
mesh outMesh/ written in 12.792607 seconds
writeVtuFile into buffers: 5.066701 seconds
writeVtuFile buffers to disk: 1.829939 seconds
vtk files rendered written in 7.314078 seconds

Note though the covert takes 2 minutes (with -g)

  1. our current code (run here with output) takes only 20 seconds to crash as follows (in the same dir same data).
mpirun -np 8 /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader ../geom3D.cnn_data ../geom3D.crd ../geom3D.match ../geom3D.class ../geom3D.fathers2D NULL ../geom3DHead.cnn outModel.dmg outMesh/

numVerts 2560663
0 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 0 lastVtx 320082
1 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 320082 lastVtx 640164
2 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 640164 lastVtx 960246
3 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 960246 lastVtx 1280328
4 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1280328 lastVtx 1600410
5 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1600410 lastVtx 1920492
6 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1920492 lastVtx 2240574
7 readMatches numvtx 2560663 localnumvtx 320089 firstVtx 2240574 lastVtx 2560663
isMatched 1
[viz002:25725] *** Process received signal ***
[viz002:25725] Signal: Segmentation fault (11)
[viz002:25725] Signal code: Address not mapped (1)
[viz002:25725] Failing at address: (nil)
[viz002:25725] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f70e3098890]
[viz002:25725] [ 1] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4ab948]
[viz002:25725] [ 2] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4abb2d]
[viz002:25725] [ 3] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4abc66]
[viz002:25725] [ 4] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4ac630]
[viz002:25725] [ 5] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(mds_create_entity+0xe1)[0x4accbc]
[viz002:25725] [ 6] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(mds_apf_create_entity+0x49)[0x4adaca]
[viz002:25725] [ 7] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf7MeshMDS13createEntity_EiPNS_11ModelEntityEPPNS_10MeshEntityE+0x13f)[0x49b579]
[viz002:25725] [ 8] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf5Mesh212createEntityEiPNS_11ModelEntityEPPNS_10MeshEntityE+0x44)[0x4d04bc]
[viz002:25725] [ 9] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf10makeOrFindEPNS_5Mesh2EPNS_11ModelEntityEiPPNS_10MeshEntityEPNS_13BuildCallbackEPb+0x6a)[0x4ceda0]
[viz002:25725] [10] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf14ElementBuilder5applyEiPPNS_10MeshEntityE+0x43)[0x4d0559]
[viz002:25725] [11] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp3runEiPPNS_10MeshEntityE+0x44)[0x4c8a2a]
[viz002:25725] [12] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4c84c5]
[viz002:25725] [13] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp7runDownEiPPNS_10MeshEntityES3_+0x3d)[0x4c89e3]
[viz002:25725] [14] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp3runEiPPNS_10MeshEntityE+0x2a)[0x4c8a10]
[viz002:25725] [15] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x4c8658]
[viz002:25725] [16] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp7runDownEiPPNS_10MeshEntityES3_+0x3d)[0x4c89e3]
[viz002:25725] [17] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf13ElementVertOp3runEiPPNS_10MeshEntityE+0x2a)[0x4c8a10]
[viz002:25725] [18] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf12buildElementEPNS_5Mesh2EPNS_11ModelEntityEiPPNS_10MeshEntityEPNS_13BuildCallbackE+0x48)[0x4cee21]
[viz002:25725] [19] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x501d44]
[viz002:25725] [20] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(_ZN3apf9constructEPNS_5Mesh2EPKliiRSt3mapIlPNS_10MeshEntityESt4lessIlESaISt4pairIS2_S6_EEE+0x54)[0x5024e6]
[viz002:25725] [21] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader(main+0x1ea)[0x48eb2c]
[viz002:25725] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f70e2cffb45]
[viz002:25725] [23] /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader[0x48b488]
[viz002:25725] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 25725 on node viz002 exited on signal 11 (Segmentation fault).

edit: formatting

KennethEJansen avatar May 28 '21 21:05 KennethEJansen

I decided to try serial. Long-mod code makes it through construct in serial but then segfaults on the line 138 in mds_apf.c on return gmi_find(m-> user_model, .... I made a debug executable rolled back to before we moved int to long.

what totalview showed me for the mesh (m) was pretty messed up so its note REALLY getting through construct cleanly. I have a new run stopping just after constuct to check that but it is probably time to build with memory sanitizer.

OK it got to my break just before delete [] m.elements on line 826 of matchedNodeElemReader.

m.elements are junk as the first entry is 140007343915089 according to totalview

I will move my search into where this is assigned to see what is going wrong.

KennethEJansen avatar May 29 '21 00:05 KennethEJansen

looks like line 705 is a problem.

gmi_fscanf(f,1, "%u%,elmVtx+j);

does not put a long into elmVtx according to TotalView which does seem to know that elmVtx is apf::Gid[6]. Here is a screenshot. I suppose there is a different format to read long ints? I will try to dig and find it but if someone can share that I will see if that is the last problem.

Screen Shot 2021-05-28 at 6 19 37 PM

KennethEJansen avatar May 29 '21 00:05 KennethEJansen

Found it. Changed %u to %ld and it seems to be working on the small case with 8 processes

Still crashing on NAS for the big case in the same place. What does do when q=t/p and t is long, p is an int and q is an int? Does it automatically cast? This is what we are doing in set Coords where it is crashing

KennethEJansen avatar May 29 '21 04:05 KennethEJansen

There is an implicit conversion happening so t is converted to an int and the division happens. This will obviously cause issues if t>max int (2^32/2-1 for 32 bit integer which is about 2 billion). If you are using gcc or clang you can use -Wconversion which will find this error. I'm not sure if there is any similar warning on XL.

See here for an example on compiler explorer.

EDIT: I was initially wrong. both t and p is converted to long then the implicit conversion happens during assignment.

See cppreference section on arithmetic operator conversion.

This can be verified with the following code: static_assert(std::is_same_v<std::common_type_t<long,int>,long>);

jacobmerson avatar May 29 '21 05:05 jacobmerson

Thanks for the advice. Before I went to sleep last night (and before I saw this), i put the following conditional print statements into setCoords Gid max = getMax(globalToVert); Gid total = max + 1; int peers = PCU_Comm_Peers(); int quotient = total / peers; int remainder = total % peers; int mySize = quotient; int self = PCU_Comm_Self(); if (self == (peers - 1)) mySize += remainder; int myOffset = self * quotient;

/* Force each peer to have exactly mySize verts. This means we might need to send and recv some coords / double c = new double[mySize*3];

int start = PCU_Exscan_Int(nverts);

PCU_Comm_Begin(); int to = std::min(peers - 1, start / quotient); int n = std::min((to+1)*quotient-start, nverts); if(n > 100000000) { lion_eprint(1, "setCoords int overflow of: self=%d,mySize=%d,total=%ld, n=%d,to=%d, quotient=%d, remainder=%d start=%d, peers=%d \n",self,mySize,total,n,to,quotient,remainder,start,peers); Gid peersG = PCU_Comm_Peers(); Gid quotientG = total / peersG; Gid remainderG = total % peersG; lion_eprint(1, "setCoords Gid0test: self=%d,mySize=%d,total=%ld, quotientG=%ld, peers=%ld \n",self,mySize,total,quotientG,remainderG,peersG); }

while (nverts > 0) { PCU_COMM_PACK(to, start); PCU_COMM_PACK(to, n); PCU_Comm_Pack(to, coords, n3sizeof(double));

They produced no output. Maybe my code was wrong or perhaps lion_eprint is buffering??? but, if it is not, then the theory that it is n which is blowing up is false. Is there a way to force lion_eprint to empty its buffer?

Note I suppose I can also dig back to past messages from Cameron to try to an addr2line command to confirm what line number we are actually crashing on in setCoord. We also have cores if someone wants to tell me how to mine them for information?

I guess I can also try to use the hints from Jacob but I will need to read up there too. NAS does not provide very modern gcc (6.2 is what they provide and what I am using).

KennethEJansen avatar May 29 '21 12:05 KennethEJansen

addr2line is not helping (or I am not using it correctly)

MPT: #12 0x0000000000523ab2 in PCU_Comm_Pack () MPT: #13 0x0000000000508bd8 in apf::setCoords(apf::Mesh2*, double const*, int, std::map<long, apf::MeshEntity*, std::less, std::allocator<std::pair<long const, apf::MeshEntity*> > >&) () MPT: #14 0x0000000000496fd4 in main () MPT: (gdb) A debugging session is active.

kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> addr2line -e /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader 0x0000000000508bd8 ??:? kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> addr2line -e /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader 0x0000000000523ab2 ??:? kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner>

That said, setCoords calls PCU_Comm_Pack only twice: the place we have been staring at

PCU_Comm_Pack(to, coords, n*3*sizeof(double));

and PCU_Comm_Pack(to, &c[i3], 3sizeof(double));

Since that last one is 3 doubles long, I don't see any way it could not be the first one which means n MUST be blowing up which I guess means lion_eprint is not emptying its buffer to help me debug this.

I am not sure what changed but I am also no longer getting arguments (e.g.n nverts=xxx) in the stack trace viz

MPT: #9 0x0000000000527ddd in noto_realloc () MPT: #10 0x0000000000526d11 in pcu_push_buffer () MPT: #11 0x0000000000527268 in pcu_msg_pack () MPT: #12 0x0000000000523ab2 in PCU_Comm_Pack () MPT: #13 0x0000000000508bd8 in apf::setCoords(apf::Mesh2*, double const*, int, std::map<long, apf::MeshEntity*, std::less, std::allocator<std::pair<long const, apf::MeshEntity*> > >&) () MPT: #14 0 x0000000000496fd4 in main () MPT: (gdb) A debugging session is active.

But I am still gettign this as the first ERROR output PUMI error: realloc(0x2405ebce0, 18446744058928065000) failed PUMI error: realloc(0x3ccbd4ed0, 18446744061822745920) failed PUMI error: realloc(0x2d6178d60, 18446744065875299208) failed PUMI error: realloc(0xeb3a6050, 18446744057770192632) failed PUMI error: realloc(0x200168420, 18446744062401682104) failed PUMI error: realloc(0x799ff4820, 18446744064717426840) failed PUMI error: realloc(0x3d737ab50, 18446744061243809736) failed PUMI error: realloc(0x21136aef0, 18446744069348916312) failed PUMI error: realloc(0x2f8aa4e60, 18446744060664873552) failed PUMI error: realloc(0x1c9a2c570, 18446744058349128816) failed PUMI error: realloc(0x5a8591d80, 18446744064138490656) failed PUMI error: realloc(0x2401c2f70, 18446744072243597232) failed PUMI error: realloc(0x2d38cb810, 18446744063559554472) failed PUMI error: realloc(0x3ccbc6d20, 18446744071085724864) failed PUMI error: realloc(0x3d2f91fb0, 18446744062980618288) failed PUMI error: realloc(0x226d3dd70, 18446744066454235392) failed PUMI error: realloc(0x5bcd4f1c0, 18446744073401469600) failed PUMI error: realloc(0x259cedc60, 18446744060085937368) failed PUMI error: realloc(0x59fe5e550, 18446744071664661048) failed PUMI error: realloc(0x3ccbc5da0, 18446744067033171576) failed PUMI error: realloc(0x226bd3430, 18446744069927852496) failed PUMI error: realloc(0x2d34b28d0, 18446744068191043944) failed PUMI error: realloc(0x1a42608a0, 18446744068769980128) failed PUMI error: realloc(0x3d0f92750, 18446744072822533416) failed PUMI error: realloc(0x2d3603280, 18446744070506788680) failed PUMI error: realloc(0x1b680a1c0, 18446744065296363024) failed PUMI error: realloc(0x2d34a3810, 18446744067612107760) failed PUMI error: realloc(0xeb376ba0, 18446744057191256448) failed PUMI error: realloc(0x2e07a4db0, 18446744059507001184) failed

There are 29 of them. Not sure if that tells us anything but not all 80 at least are reporting this error before crashing.

KennethEJansen avatar May 29 '21 13:05 KennethEJansen

Core fie interrogation with gdb is not helping either. kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> gdb /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader core.19660 GNU gdb (GDB; SUSE Linux Enterprise 12) 8.3.1 Copyright (C) 2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-suse-linux". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://bugs.opensuse.org/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.

For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader... BFD: warning: /nobackupp2/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner/core.19660 is truncated: expected core file size >= 35391074304, found: 32594722816

warning: core file may not match specified executable file. [New LWP 19660] Cannot access memory at address 0x2aaaaacce128 Cannot access memory at address 0x2aaaaacce120 Failed to read a valid object file image from memory. Core was generated by `/home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader geom3D.c'. Program terminated with signal SIGABRT, Aborted. #0 0x00002aaab019b247 in ?? ()

KennethEJansen avatar May 29 '21 15:05 KennethEJansen

lion_eprint

Are you capturing stdout and stderr? lion_eprint writes to stderr; e stands for error.

The verbosity level set here

https://github.com/SCOREC/core/blob/6dd96dbb378cac7815d41da34f0ee23a983da4f9/test/matchedNodeElmReader.cc#L790

is correct for the added call(s) to lion_eprint; i.e., lion_verbosity_level >= the level set for the given print statement.

https://github.com/SCOREC/core/blob/4d659af3c5f01420013823e2a2b6a354e7d92deb/lion/lionPrint.c#L32

addr2line and gdb

The usage looks OK to me. As a sanity check, what is the output of the following command?

file /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader

cwsmith avatar May 29 '21 16:05 cwsmith

kjansen@pfe27:/nobackup/kjansen/SeparatedBump/DNS/ReL_2M/MGEN/mner> file /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader /home5/kjansen/SCOREC-core/buildMGEN_write3D/test/matchedNodeElmReader: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.0.0, not stripped

KennethEJansen avatar May 29 '21 16:05 KennethEJansen

Yes, I am looking in both stdout and stderr. Yes, line 790 has what you show so I would expect that my lion_eprint statements are not being stopped by verbosity level.

Does lion_eprint flush its buffer? If not, given that these are likely the lines written right before a crash then they may not get flushed before mpi kills the job?

Is there a function to flush the buffer?

KennethEJansen avatar May 29 '21 16:05 KennethEJansen

OK. The output lists 'not stripped' so there are debug symbols... GDB and addr2line should work. I guess MPT may be doing something unexpected.

lion_eprint writes to stderr which is not buffered on systems I'm familiar with.

https://github.com/SCOREC/core/blob/4d659af3c5f01420013823e2a2b6a354e7d92deb/lion/lionPrint.c#L52

cwsmith avatar May 29 '21 16:05 cwsmith

ok then I guess we would have to conclude that n is NOT > 100 M.

I have made a code that makes every variable in this routing Gid and am waiting for it to run. Here is the altered source. Note I have duplicated nverts to nvertsG and still use nverts in some function calls that I suspect expect an int and not a long and am hoping that setting that equal to a Gid variable gets cast properly across the assignment though I am still unsure about what C++ does here. I know what fortran does and this works there. void setCoords(Mesh2* m, const double* coords, int nverts, GlobalToVert& globalToVert) { Gid nvertsG=nverts; Gid max = getMax(globalToVert); Gid total = max + 1; Gid peers = PCU_Comm_Peers(); Gid quotient = total / peers; Gid remainder = total % peers; Gid mySize = quotient; Gid self = PCU_Comm_Self(); if (self == (peers - 1)) mySize += remainder; Gid myOffset = self * quotient;

/* Force each peer to have exactly mySize verts. This means we might need to send and recv some coords / double c = new double[mySize*3];

Gid start = PCU_Exscan_Int(nverts);

PCU_Comm_Begin(); Gid to = std::min(peers - 1, start / quotient); Gid n = std::min((to+1)*quotient-start, nvertsG); if(n > 100000000) { lion_eprint(1, "setCoords int overflow of: self=%d,mySize=%d,total=%ld, n=%d,to=%d, quotient=%d, remainder=%d start=%d, peers=%d \n",self,mySize,total,n,to,quotient,remainder,start,peers); Gid peersG = PCU_Comm_Peers(); Gid quotientG = total / peersG; Gid remainderG = total % peersG; lion_eprint(1, "setCoords Gid0test: self=%d,mySize=%d,total=%ld, quotientG=%ld, peers=%ld \n",self,mySize,total,quotientG,remainderG,peersG); }

while (nvertsG > 0) { PCU_COMM_PACK(to, start); PCU_COMM_PACK(to, n); PCU_Comm_Pack(to, coords, n3sizeof(double));

nvertsG -= n;
start += n;
coords += n*3;
to = std::min(peers - 1, to + 1);
n = std::min(quotient, nvertsG);

} PCU_Comm_Send(); while (PCU_Comm_Receive()) { PCU_COMM_UNPACK(start); PCU_COMM_UNPACK(n); PCU_Comm_Unpack(&c[(start - myOffset) * 3], n3sizeof(double)); }

/* Tell all the owners of the coords what we need */ typedef std::vector< std::vector<Gid> > TmpParts; TmpParts tmpParts(mySize); PCU_Comm_Begin(); APF_CONST_ITERATE(GlobalToVert, globalToVert, it) { Gid gid = it->first; Gid to = std::min(peers - 1, gid / quotient);

I will push this code and pull back to the viz nodes to see if it works on the small case still as I am waiting in the queue anyway.

KennethEJansen avatar May 29 '21 17:05 KennethEJansen

Confirmed. Gid for all integers in setCoords works for the small problem on the viz nodes.

verify is not commented out so the mesh created seems is at least correct to that measure.

I guess we will see if this resolves the int overflow issue in PCU_Comm_Pack

KennethEJansen avatar May 29 '21 17:05 KennethEJansen

While I had the code there, I also modified it to trigger my debug statements when n > 10000 instead of 100M. At least on the viz nodes, this did produce the expected output from lion_eprint. From this we may??? conclude that, at the time the conditional is evaluated n is NOT overflowing its int and thus n3sizeof(double) should not be the absurd number that PCU_Comm_Pack is trying to allocate??

KennethEJansen avatar May 29 '21 17:05 KennethEJansen

I'm just loosely tracking what you guys are doing but for the assignment you are doing the implicit conversion is functionally equivalent to the following.

int nverts;
long nvertsG = static_cast<long>(nverts);

This is safe since it is a widening conversion i.e., long has at least as many bits as int.

jacobmerson avatar May 29 '21 20:05 jacobmerson

So my lion_eprint produced 59 sensible statements (on 80 processes) and 21 core files that did not report their output. I think I need to throw a barrier before the call to PCU_Comm_Pack so that I can get all of the output? If anybody can save me searching the code for an example of how you folks do barriers that would be great.

KennethEJansen avatar May 30 '21 20:05 KennethEJansen

I think you can do the following: https://github.com/SCOREC/core/blob/4d659af3c5f01420013823e2a2b6a354e7d92deb/pcu/PCU.h#L55

Alternatively, you can get the MPI Comm and do whatever you want... https://github.com/SCOREC/core/blob/4d659af3c5f01420013823e2a2b6a354e7d92deb/pcu/PCU.h#L121

jacobmerson avatar May 30 '21 20:05 jacobmerson

Yep. I just added this if(n > 1000) { Gid sizeToSend=n3sizeof(double); lion_eprint(1, "setCoords int overflow of: self=%ld,mySize=%ld,total=%ld, n=%ld,to=%ld, quotient=%ld, remainder=%ld start=%ld, peers=%ld, sizeToSend=%ld, nvertsG=%ld, nverts=%u \n",self,mySize,total,n,to,quotient,remainder,start,peers,sizeToSend,nvertsG,nverts); // Gid peersG = PCU_Comm_Peers(); // Gid quotientG = total / peersG; // Gid remainderG = total % peersG; // lion_eprint(1, "setCoords Gid0test: self=%d,mySize=%d,total=%ld, quotientG=%ld, remainderG=%ld,peers=%ld \n",self,mySize,total,quotientG,remainderG,peersG); } PCU_Barrier();

while (nvertsG > 0) { PCU_COMM_PACK(to, start); PCU_COMM_PACK(to, n); PCU_Comm_Pack(to, coords, n3sizeof(double));

KennethEJansen avatar May 30 '21 21:05 KennethEJansen

That said, I don't see how our problem could be located here. Lets test this logic path:

  1. this code streams in N mesh nodes from a single coordinates file into 80 processes
  2. the logic is such that N/80 go onto each of the first 79 processes and process 80 gets that number plus the < 80 leftovers
  3. thus we should have 79 equal sized node counts and one process with negligibly more when N is almost 2B
  4. so that is true of total nodes but I guess what this routine is doing is figuring out how many are SHARED with a peer but this HAS to be less than or equal to ALL of the nodes on a given part and is usually 1/10 the for very heavy parts like this.

TO help us this is the output from one of the 59 "successful" processes that did not core dump setCoords int overflow of: self=26,mySize=15217533,total=1217402706, n=11955520,to=41, quotient=15217533, remainder=66 start=627180866, peers=80, sizeToSend=286932480, nvertsG=24122341, nverts=24122341

Decoding I think this means (CWS can confirm Rank 26 thinks its size is 15.2M of a total of 1.2B?????wait why is this less than 1.9xxB? as we see here?

numVerts 1929787324 0 readMatches numvtx 1929787324 localnumvtx 24122341 firstVtx 0 lastVtx 24122341 1 readMatches numvtx 1929787324 localnumvtx 24122341 firstVtx 24122341 lastVtx 48244682 2 readMatches numvtx 1929787324 localnumvtx 24122341 firstVtx 48244682 lastVtx 72367023

since there are 80 parts we expect 24M nodes per part so I don't know where total=1.5B comes from. OK I see in the code it comes from

void setCoords(Mesh2* m, const double* coords, int nverts, GlobalToVert& globalToVert) { Gid nvertsG=nverts; Gid max = getMax(globalToVert);

But how can that be so far off (1.5B instead of 1.9B)?

nverts and nvertsG look ok in that they agree with what the match line reports but I don't know what GobalToVert is doing?

Double checking my small case, total makes sense there

(base) kjansen@viz002: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-12-30/MGEN4mner_noIDX/mner $ mpirun -np 8 /projects/tools/SCOREC-core/buildDbg/test/matchedNodeElmReader ../geom3D.cnn_data ../geom3D.crd ../geom3D.match ../geom3D.class ../geom3D.fathers2D NULL ../geom3DHead.cnn outModel.dmg outMesh/ numVerts 2560663 0 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 0 lastVtx 320082 1 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 320082 lastVtx 640164 2 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 640164 lastVtx 960246 3 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 960246 lastVtx 1280328 4 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1280328 lastVtx 1600410 5 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1600410 lastVtx 1920492 6 readMatches numvtx 2560663 localnumvtx 320082 firstVtx 1920492 lastVtx 2240574 7 readMatches numvtx 2560663 localnumvtx 320089 firstVtx 2240574 lastVtx 2560663 isMatched 1 setCoords int overflow of: self=1,mySize=320082,total=2560663, n=320082,to=1, quotient=320082, remainder=7 start=320082, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=2,mySize=320082,total=2560663, n=320082,to=2, quotient=320082, remainder=7 start=640164, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=3,mySize=320082,total=2560663, n=320082,to=3, quotient=320082, remainder=7 start=960246, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=4,mySize=320082,total=2560663, n=320082,to=4, quotient=320082, remainder=7 start=1280328, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=5,mySize=320082,total=2560663, n=320082,to=5, quotient=320082, remainder=7 start=1600410, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=6,mySize=320082,total=2560663, n=320082,to=6, quotient=320082, remainder=7 start=1920492, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=0,mySize=320082,total=2560663, n=320082,to=0, quotient=320082, remainder=7 start=0, peers=8, sizeToSend=7681968, nvertsG=320082, nverts=320082 setCoords int overflow of: self=7,mySize=320089,total=2560663, n=320082,to=7, quotient=320082, remainder=7 start=2240574, peers=8, sizeToSend=7681968, nvertsG=320089, nverts=320089

So I guess this means we have been staring at the wrong chunk of code. GlobalToVerts is messed up long before we get here as it SHOULD equal the total number of vertices in the serial mesh.

KennethEJansen avatar May 30 '21 21:05 KennethEJansen