bug: out of memory with "big I/O"
Originally by robl on 2014-10-01 20:04:45 -0500
[email protected] reports to me the following:
I finally got a test machine to try and push the bigio branch in shape to be merged into trunk at least.
In HDF5, when I have processes accessing a Big column in a dataset, I am seeing an OOM failure. I traced it down to MPI_File_set_view().
I do not see a problem with big rows, but that is because the HDF5 hyperslab algorithm to transform them into derived datatypes flattens out the 2nd dimension.
I replicated what the algorithm does in constructing the file type for a 2D dataset with 2 big columns to trigger the OOM in set_view.
Running the program with 2 procs should replicate.
mpirun -np 2 ./bigio_mpi test_file
This is with mpich 3.1.2
Originally by robl on 2014-11-25 16:27:24 -0600
Attachment added: bigio_mpi.c (4.3 KiB)
updated test case to clean up all resources. last version had a too-small buffer
Originally by robl on 2014-11-26 14:33:36 -0600
This is going to be a tough one for ROMIO: the flattened representation is larger than the type itself. To describe the flattened representation, romio tries to allocate 4 billion bytes (twice)
68 if (flat->count) { 69 flat->blocklens = (ADIO_Offset *) ADIOI_Malloc(flat->count * sizeof(ADIO_Offs et)); 70 flat->indices = (ADIO_Offset *) ADIOI_Malloc(flat->count * sizeof(ADIO_Offset )); 71 }
This, on top of the 4 billion byte allocation from the calling program means we need a tremendous amount of memory to represent this type.
Here's the dataloop representation. dataloops (and their ability to process types in a piece-wise fashion) are the real answer to this workload.
# rank 1
# MPIU_Datatype_debug: MPI_Datatype = 0xcc000002 (derived)
# Size # 4,294,967,328, Extent8589934656, LB # 0(sticky), UB8589934656(sticky), Extent # 8589934656, Element Size1 (MPI_BYTE), is not N contig
# Contents:
# combiner: vector
# vector ct # 1, blk1073741832, str = 1
# combiner: resized
# Dataloop:
digraph 0xccda18 { {
dl0 [shape # record, label"contig |{ ct # 1073741832; el_sz4; el_ext = 8 }"];
dl0 -> dl1;
dl1 [shape # record, label"blockindexed |{ ct # 1; blk1; disps # 4; el_sz4; el_ext = 4 }"];
dl1 -> dl2;
dl2 [shape # record, label"contig |{ ct # 4; el_sz1; el_ext = 1 }"];
}
}
Originally by robl on 2016-05-16 21:07:54 -0500
Attachment added: bigio_viewonly.c (2.9 KiB)
the bigio test stripped to only set the file view
The size of flat is proportional to the number of contiguous blocks. We probably should set a threshold and use some generic fallback for the case such as reported here.
in 2005 Rob Ross wrote "mpitypes" code (also know as 'dataloop' code) which did a more clever job of maintaining datatype state and information. Instead of flattening the entire type the segment processing code could just return the "next" segments.
I'm not familiar with Yaksa but surely it's more clever than the flattening code. Rewriting ROMIO to use Yaksa would be a pretty big undertaking.
We'd need a significantly more robust test suite before embarking on any kind of ROMIO reengineering. (that's been on my list for years too)
I don't know where we'd put the generic fallback...because MPI decouples the file view and the memory type, and because file views are tiled, it's hard to find a natural place to split up this large noncontiguous type into smaller regions.
one peephole optimization for this specific case might be some kind of read-modify-write? I don't know where we'd put a temporary multi-gigabyte file though.
There is typerep api (for both dataloop and yaksa) to get a list noncontig segments in a range as an IOV array. The fallback I have in mind is to call this API on demand rather than statically store the whole list. It is not efficient especially in yaksa, but as a fallback I think it is better than out-of-memory. At least it become a yaksa optimization issue.
Currently typerep APIs are not exposed. We need expose it via MPIX extension if we decide to do this.
The datatype in this issue is a contig type of count = 1073741832. For the purpose of file view, this is equivalent to a single inner type. I think this can be optimized.
in 2005 Rob Ross wrote "mpitypes" code (also know as 'dataloop' code) which did a more clever job of maintaining datatype state and information. Instead of flattening the entire type the segment processing code could just return the "next" segments.
I'm not familiar with Yaksa but surely it's more clever than the flattening code. Rewriting ROMIO to use Yaksa would be a pretty big undertaking.
With the MPIX_Type_iov extension (#6139), it is not difficult to do a dataloop-like code. However, all the current aggregation IO algorithms are tied to the flatlist (iov segment list). It will be a major overhaul to change the code structure -- yes, big undertaking.
In this particular example, the file type is essentially a huge "type_contig", which can be optimized into an equivalent single element. But I guess potentially there are valid cases with huge less-regular file type that make it worthwhile to fix. Although, we haven't heard another complaint for the last 6 years.
Also, the bigio that this issue originated from, is less relevant now that MPI-4 has large-count APIs.