GEOS icon indicating copy to clipboard operation
GEOS copied to clipboard

Draft: Memory Space Aware Packing

Open wrtobin opened this issue 2 years ago • 25 comments

  • [x] Pack device-packable objects on device whenever possible, otherwise pack on host
  • [x] Cleanup packing traits
  • [x] Unit tests passing
  • [ ] Regression tests passing on quartz and lassen

wrtobin avatar Apr 03 '23 20:04 wrtobin

~~I have an issue to merge and test locally this branch because of the integratedTests~~ I managed to merge it and compile it

Trying this on an example with #2403 (output scalar values) gives me a memory corruption

Rank 0: Writing out restart file at co2/co2_flux_3d_restart_000000000/rank_0000000.hdf5
double free or corruption (out)

Thread 1 "geosx" received signal SIGABRT, Aborted.
0x00007fffed2a08ec in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fffed2a08ec in ?? () from /usr/lib/libc.so.6
#1  0x00007fffed251ea8 in raise () from /usr/lib/libc.so.6
#2  0x00007fffed23b53d in abort () from /usr/lib/libc.so.6
#3  0x00007fffed23c29e in ?? () from /usr/lib/libc.so.6
#4  0x00007fffed2aa657 in ?? () from /usr/lib/libc.so.6
#5  0x00007fffed2ac7b0 in ?? () from /usr/lib/libc.so.6
#6  0x00007fffed2aee63 in free () from /usr/lib/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x00007fffeef7b7a7 in mca_coll_base_comm_unselect () from /usr/lib/libmpi.so.40
#8  0x00007fffeef07dcf in ?? () from /usr/lib/libmpi.so.40
#9  0x00007fffeef044a1 in ompi_comm_free () from /usr/lib/libmpi.so.40
#10 0x00007fffeef3bb37 in PMPI_Comm_free () from /usr/lib/libmpi.so.40
#11 0x00007ffff3579fc5 in geos::MpiWrapper::commFree (comm=@0x555555e46a28: 0x555555e2fb50)
    at /GEOSX/src/coreComponents/common/MpiWrapper.cpp:70
#12 0x00007ffff46eca15 in geos::HDFHistoryIO::setupPartition (this=0x555555e46910, localIdxCount=1000)
    at /GEOSX/src/coreComponents/fileIO/timeHistory/HDFHistoryIO.cpp:174
#13 0x00007ffff46ed9b0 in geos::HDFHistoryIO::write (this=0x555555e46910)
    at /GEOSX/src/coreComponents/fileIO/timeHistory/HDFHistoryIO.cpp:263
#14 0x00007ffff45b77bd in geos::TimeHistoryOutput::execute (this=0x555555921560)
    at /GEOSX/src/coreComponents/fileIO/Outputs/TimeHistoryOutput.cpp:156
#15 0x00007ffff6149c89 in geos::EventBase::execute (this=0x5555558b0e20, time_n=0, dt=50000, cycleNumber=0, domain=...)
    at /GEOSX/src/coreComponents/events/EventBase.cpp:226
#16 0x00007ffff6150a01 in geos::EventManager::run (this=0x55555562a8b0, domain=...)
    at /GEOSX/src/coreComponents/events/EventManager.cpp:189
#17 0x00007ffff616832c in geos::ProblemManager::runSimulation (this=0x55555585c4a0)
    at /GEOSX/src/coreComponents/mainInterface/ProblemManager.cpp:958
#18 0x00007ffff615ea18 in geos::GeosxState::run (this=0x7fffffffcd80)
    at /GEOSX/src/coreComponents/mainInterface/GeosxState.cpp:207
#19 0x00005555555585b5 in main (argc=5, argv=0x7fffffffd5b8)
    at /GEOSX/src/main/main.cpp:45

untereiner avatar Apr 19 '23 12:04 untereiner

Yeah I need to make some time to come back around on this and debug the runtime issues. Likely needs some traits cleanup of some sort. I have some local work put into that, but haven't had time to set aside since last week.

wrtobin avatar Apr 19 '23 14:04 wrtobin

Thanks for the work! I was thinking of trying it and was hoping for more relevant feedback...

untereiner avatar Apr 19 '23 15:04 untereiner

Based on the stack trace my assumption is that a piece of data to be packed wasn't able to correctly serialize (possibly calling an incorrect version of the packing functions due to SFINAE or possibly erroneously thinking it was in the wrong memory space (depending on the platform)), and probably reported either an erroneous (on at least 1 process) or 0-size (on all processes).

When going to the hdf file-writer we create subcomms for the MPIO file operations to collectively take place across. It is possible that based on the incorrect sizing information this communicator was ill-formed in some way, or simply a segmentation violation occurred during the file write due to the flawed packing and the ultimate failure happened at the next free() operation, being the comm free process.

Without attaching a debugger I can't say with any precision, the above is all just best estimate based on the changes in the branch.

wrtobin avatar Apr 19 '23 15:04 wrtobin

Sorry poor english... I was hoping to provide to you more relevant feedback on the work you have done. Anyway thanks for your comments. I think I understand better the flow

untereiner avatar Apr 19 '23 15:04 untereiner

Ah! Solid.

Nothing in this branch changes our ability to pack scalar quantities, we still likely need to add support for that operation to BufferOps.[hc]pp since those types should only ever be collected from the host memory space.

wrtobin avatar Apr 19 '23 15:04 wrtobin

There is the trait is_noncontainer_type_packable for int or double types that targets the host I guess. and pack has a specialization for trivial types.

template< bool DO_PACKING, typename T >
typename std::enable_if< std::is_trivial< T >::value, localIndex >::type
Pack( buffer_unit_type * & buffer,
      T const & var );

isn't this enough ?

untereiner avatar Apr 20 '23 12:04 untereiner

@wrtobin I locally resolved the conflicts vs develop branch.. can I push the commits to this branch ? I also worked on a trait to enable scalar packing

untereiner avatar May 04 '23 12:05 untereiner

Based on the stack trace my assumption is that a piece of data to be packed wasn't able to correctly serialize (possibly calling an incorrect version of the packing functions due to SFINAE or possibly erroneously thinking it was in the wrong memory space (depending on the platform)), and probably reported either an erroneous (on at least 1 process) or 0-size (on all processes).

When going to the hdf file-writer we create subcomms for the MPIO file operations to collectively take place across. It is possible that based on the incorrect sizing information this communicator was ill-formed in some way, or simply a segmentation violation occurred during the file write due to the flawed packing and the ultimate failure happened at the next free() operation, being the comm free process.

Without attaching a debugger I can't say with any precision, the above is all just best estimate based on the changes in the branch.

I think your comment is right. The error is already present at commit #c71fb78af. But besides the bool argument and the new traits I don't see where an incorrect version of the packing could be called. Did you managed to work on it recently ?

untereiner avatar May 09 '23 17:05 untereiner

Hi @wrtobin, do you think you could take a look at it ?

untereiner avatar Jun 13 '23 13:06 untereiner

@untereiner You should be able to start trying to use this again. I fixed the bug and things seem to be working.

The primary issue was checking the device packability of the wrapper-contained type itself instead of the decltype( referenceAsView() ) type, since device-packing operates on the type returned by that function in order to translate e.g. arrays to arrayViews automatically.

There is one bug I'm still trying to iron out that would allow me to remove the LvArray PR, since that currently only exists to allow us to bypass a host-packing which collects uninitialized memory for some reason. I'm going to try to fix that and drop the LvArray PR, but this branch should be testable now.

wrtobin avatar Jun 20 '23 21:06 wrtobin

Thank you very much @wrtobin ! I'll try it!

untereiner avatar Jun 21 '23 08:06 untereiner

@untereiner There is still an existing bug with gathering host-side data, since we refactored the device-packing functions to allow us to pack either just underlying data (for timehistory) or both metadata and underlying data (for mpi buffer packing).

The host-side packing functions weren't similarly refactored, so packing on host will currently result in the buffer being corrupted/misaligned and the timehistory file will be garbage.

I'm working on the minimal set of host-packing refactoring needed to get things working, I'm mostly done, but still debugging a couple things.

wrtobin avatar Jun 22 '23 15:06 wrtobin

Hi @wrtobin! I tested this branch with mine #2490 and its seems to work. I can output {max,min,average}pressure per time step from regionStatistics in an hdf5 file.

Have you more work | bugs to iron out on this one ?

untereiner avatar Jul 11 '23 13:07 untereiner

Just pushed a fix for the ctest failure. I'm going to run regression tests on quartz and lassen and assuming they all pass I'll promote this for review / merge.

wrtobin avatar Jul 19 '23 18:07 wrtobin

Great! I wasn’t able to find this one

untereiner avatar Jul 19 '23 20:07 untereiner

Hi @wrtobin ! Are the regression tests passing ?

untereiner avatar Jul 26 '23 15:07 untereiner

Quartz was passing. I haven't been able to run on Lassen yet, and I'm at a hackathon working on kernel performance on AMD until Thursday. I should be able to focus on Lassen testing on Friday.

wrtobin avatar Jul 26 '23 15:07 wrtobin

Is Lassen passing @wrtobin ?

untereiner avatar Aug 04 '23 15:08 untereiner

No, but there are failing tests in develop on Lassen right now and the set of tests looks "similar", so I think this is probably fine, but I'm unwilling to merge in a change to the core infrastructure like this until our LC test platforms are all passing.

I'm hoping to take a look at and try to resolve the issues in develop early next week as we also have an OS upgrade coming for Lassen at the end of the month and we want the tests passing before the upgrade.

wrtobin avatar Aug 04 '23 15:08 wrtobin

Hello @wrtobin do you think this could go in? It's bringing an important feature that could help us moving on about TimeHistory and some scalar data packing which is critical to our end-users.

TotoGaz avatar Sep 07 '23 22:09 TotoGaz

knock knock

paveltomin avatar Feb 27 '24 22:02 paveltomin

knock knock

... knockin' on Heaven's door (?)

Hi @wrtobin ! It's almost one year old! How does the regression tests look ?

untereiner avatar Mar 06 '24 15:03 untereiner

There is a failure in 2 mpm tests. I anticipate it being relatively straightforward to track down tomorrow. Once that is done I will promote this from draft and place it in the queue.

Note there are still issues with our integrated tests on GPU platforms (unrelated to this PR) that prevent exhaustive vetting, but in the interest of utility we've chosen to merge this and resolve any potential (but unexpected) correctness bugs post-merge.

wrtobin avatar Apr 02 '24 23:04 wrtobin

Codecov Report

Attention: Patch coverage is 44.95413% with 240 lines in your changes missing coverage. Please review.

Project coverage is 53.25%. Comparing base (f05008b) to head (4311984).

:exclamation: Current head 4311984 differs from pull request most recent head 88935b9

Please upload reports for the commit 88935b9 to get more accurate results.

Files Patch % Lines
...coreComponents/dataRepository/BufferOps_inline.hpp 48.36% 63 Missing :warning:
...lvers/surfaceGeneration/ParallelTopologyChange.cpp 0.00% 27 Missing :warning:
src/coreComponents/dataRepository/Wrapper.hpp 50.98% 25 Missing :warning:
src/coreComponents/dataRepository/Group.cpp 11.11% 16 Missing :warning:
src/coreComponents/mesh/ParticleManager.cpp 0.00% 12 Missing :warning:
src/coreComponents/mesh/ObjectManagerBase.cpp 62.06% 11 Missing :warning:
...coreComponents/mesh/EmbeddedSurfaceNodeManager.cpp 0.00% 9 Missing :warning:
...hysicsSolvers/solidMechanics/SolidMechanicsMPM.cpp 0.00% 8 Missing :warning:
...c/coreComponents/dataRepository/wrapperHelpers.hpp 45.45% 6 Missing :warning:
...sics/SinglePhasePoromechanicsEmbeddedFractures.hpp 0.00% 6 Missing :warning:
... and 29 more
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2369      +/-   ##
===========================================
- Coverage    55.77%   53.25%   -2.53%     
===========================================
  Files         1035      987      -48     
  Lines        87914    83540    -4374     
===========================================
- Hits         49035    44488    -4547     
- Misses       38879    39052     +173     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 03 '24 00:04 codecov[bot]