Draft: Memory Space Aware Packing
- [x] Pack device-packable objects on device whenever possible, otherwise pack on host
- [x] Cleanup packing traits
- [x] Unit tests passing
- [ ] Regression tests passing on quartz and lassen
~~I have an issue to merge and test locally this branch because of the integratedTests~~ I managed to merge it and compile it
Trying this on an example with #2403 (output scalar values) gives me a memory corruption
Rank 0: Writing out restart file at co2/co2_flux_3d_restart_000000000/rank_0000000.hdf5
double free or corruption (out)
Thread 1 "geosx" received signal SIGABRT, Aborted.
0x00007fffed2a08ec in ?? () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007fffed2a08ec in ?? () from /usr/lib/libc.so.6
#1 0x00007fffed251ea8 in raise () from /usr/lib/libc.so.6
#2 0x00007fffed23b53d in abort () from /usr/lib/libc.so.6
#3 0x00007fffed23c29e in ?? () from /usr/lib/libc.so.6
#4 0x00007fffed2aa657 in ?? () from /usr/lib/libc.so.6
#5 0x00007fffed2ac7b0 in ?? () from /usr/lib/libc.so.6
#6 0x00007fffed2aee63 in free () from /usr/lib/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#7 0x00007fffeef7b7a7 in mca_coll_base_comm_unselect () from /usr/lib/libmpi.so.40
#8 0x00007fffeef07dcf in ?? () from /usr/lib/libmpi.so.40
#9 0x00007fffeef044a1 in ompi_comm_free () from /usr/lib/libmpi.so.40
#10 0x00007fffeef3bb37 in PMPI_Comm_free () from /usr/lib/libmpi.so.40
#11 0x00007ffff3579fc5 in geos::MpiWrapper::commFree (comm=@0x555555e46a28: 0x555555e2fb50)
at /GEOSX/src/coreComponents/common/MpiWrapper.cpp:70
#12 0x00007ffff46eca15 in geos::HDFHistoryIO::setupPartition (this=0x555555e46910, localIdxCount=1000)
at /GEOSX/src/coreComponents/fileIO/timeHistory/HDFHistoryIO.cpp:174
#13 0x00007ffff46ed9b0 in geos::HDFHistoryIO::write (this=0x555555e46910)
at /GEOSX/src/coreComponents/fileIO/timeHistory/HDFHistoryIO.cpp:263
#14 0x00007ffff45b77bd in geos::TimeHistoryOutput::execute (this=0x555555921560)
at /GEOSX/src/coreComponents/fileIO/Outputs/TimeHistoryOutput.cpp:156
#15 0x00007ffff6149c89 in geos::EventBase::execute (this=0x5555558b0e20, time_n=0, dt=50000, cycleNumber=0, domain=...)
at /GEOSX/src/coreComponents/events/EventBase.cpp:226
#16 0x00007ffff6150a01 in geos::EventManager::run (this=0x55555562a8b0, domain=...)
at /GEOSX/src/coreComponents/events/EventManager.cpp:189
#17 0x00007ffff616832c in geos::ProblemManager::runSimulation (this=0x55555585c4a0)
at /GEOSX/src/coreComponents/mainInterface/ProblemManager.cpp:958
#18 0x00007ffff615ea18 in geos::GeosxState::run (this=0x7fffffffcd80)
at /GEOSX/src/coreComponents/mainInterface/GeosxState.cpp:207
#19 0x00005555555585b5 in main (argc=5, argv=0x7fffffffd5b8)
at /GEOSX/src/main/main.cpp:45
Yeah I need to make some time to come back around on this and debug the runtime issues. Likely needs some traits cleanup of some sort. I have some local work put into that, but haven't had time to set aside since last week.
Thanks for the work! I was thinking of trying it and was hoping for more relevant feedback...
Based on the stack trace my assumption is that a piece of data to be packed wasn't able to correctly serialize (possibly calling an incorrect version of the packing functions due to SFINAE or possibly erroneously thinking it was in the wrong memory space (depending on the platform)), and probably reported either an erroneous (on at least 1 process) or 0-size (on all processes).
When going to the hdf file-writer we create subcomms for the MPIO file operations to collectively take place across. It is possible that based on the incorrect sizing information this communicator was ill-formed in some way, or simply a segmentation violation occurred during the file write due to the flawed packing and the ultimate failure happened at the next free() operation, being the comm free process.
Without attaching a debugger I can't say with any precision, the above is all just best estimate based on the changes in the branch.
Sorry poor english... I was hoping to provide to you more relevant feedback on the work you have done. Anyway thanks for your comments. I think I understand better the flow
Ah! Solid.
Nothing in this branch changes our ability to pack scalar quantities, we still likely need to add support for that operation to BufferOps.[hc]pp since those types should only ever be collected from the host memory space.
There is the trait is_noncontainer_type_packable for int or double types that targets the host I guess.
and pack has a specialization for trivial types.
template< bool DO_PACKING, typename T >
typename std::enable_if< std::is_trivial< T >::value, localIndex >::type
Pack( buffer_unit_type * & buffer,
T const & var );
isn't this enough ?
@wrtobin I locally resolved the conflicts vs develop branch.. can I push the commits to this branch ? I also worked on a trait to enable scalar packing
Based on the stack trace my assumption is that a piece of data to be packed wasn't able to correctly serialize (possibly calling an incorrect version of the packing functions due to SFINAE or possibly erroneously thinking it was in the wrong memory space (depending on the platform)), and probably reported either an erroneous (on at least 1 process) or 0-size (on all processes).
When going to the hdf file-writer we create subcomms for the MPIO file operations to collectively take place across. It is possible that based on the incorrect sizing information this communicator was ill-formed in some way, or simply a segmentation violation occurred during the file write due to the flawed packing and the ultimate failure happened at the next free() operation, being the comm free process.
Without attaching a debugger I can't say with any precision, the above is all just best estimate based on the changes in the branch.
I think your comment is right. The error is already present at commit #c71fb78af. But besides the bool argument and the new traits I don't see where an incorrect version of the packing could be called. Did you managed to work on it recently ?
Hi @wrtobin, do you think you could take a look at it ?
@untereiner You should be able to start trying to use this again. I fixed the bug and things seem to be working.
The primary issue was checking the device packability of the wrapper-contained type itself instead of the decltype( referenceAsView() ) type, since device-packing operates on the type returned by that function in order to translate e.g. arrays to arrayViews automatically.
There is one bug I'm still trying to iron out that would allow me to remove the LvArray PR, since that currently only exists to allow us to bypass a host-packing which collects uninitialized memory for some reason. I'm going to try to fix that and drop the LvArray PR, but this branch should be testable now.
Thank you very much @wrtobin ! I'll try it!
@untereiner There is still an existing bug with gathering host-side data, since we refactored the device-packing functions to allow us to pack either just underlying data (for timehistory) or both metadata and underlying data (for mpi buffer packing).
The host-side packing functions weren't similarly refactored, so packing on host will currently result in the buffer being corrupted/misaligned and the timehistory file will be garbage.
I'm working on the minimal set of host-packing refactoring needed to get things working, I'm mostly done, but still debugging a couple things.
Hi @wrtobin! I tested this branch with mine #2490 and its seems to work. I can output {max,min,average}pressure per time step from regionStatistics in an hdf5 file.
Have you more work | bugs to iron out on this one ?
Just pushed a fix for the ctest failure. I'm going to run regression tests on quartz and lassen and assuming they all pass I'll promote this for review / merge.
Great! I wasn’t able to find this one
Hi @wrtobin ! Are the regression tests passing ?
Quartz was passing. I haven't been able to run on Lassen yet, and I'm at a hackathon working on kernel performance on AMD until Thursday. I should be able to focus on Lassen testing on Friday.
Is Lassen passing @wrtobin ?
No, but there are failing tests in develop on Lassen right now and the set of tests looks "similar", so I think this is probably fine, but I'm unwilling to merge in a change to the core infrastructure like this until our LC test platforms are all passing.
I'm hoping to take a look at and try to resolve the issues in develop early next week as we also have an OS upgrade coming for Lassen at the end of the month and we want the tests passing before the upgrade.
Hello @wrtobin do you think this could go in?
It's bringing an important feature that could help us moving on about TimeHistory and some scalar data packing which is critical to our end-users.
knock knock
knock knock
... knockin' on Heaven's door (?)
Hi @wrtobin ! It's almost one year old! How does the regression tests look ?
There is a failure in 2 mpm tests. I anticipate it being relatively straightforward to track down tomorrow. Once that is done I will promote this from draft and place it in the queue.
Note there are still issues with our integrated tests on GPU platforms (unrelated to this PR) that prevent exhaustive vetting, but in the interest of utility we've chosen to merge this and resolve any potential (but unexpected) correctness bugs post-merge.
Codecov Report
Attention: Patch coverage is 44.95413% with 240 lines in your changes missing coverage. Please review.
Project coverage is 53.25%. Comparing base (
f05008b) to head (4311984).
:exclamation: Current head 4311984 differs from pull request most recent head 88935b9
Please upload reports for the commit 88935b9 to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## develop #2369 +/- ##
===========================================
- Coverage 55.77% 53.25% -2.53%
===========================================
Files 1035 987 -48
Lines 87914 83540 -4374
===========================================
- Hits 49035 44488 -4547
- Misses 38879 39052 +173
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.