mpich bug: out of memory with "big I/O"

Originally by robl on 2014-10-01 20:04:45 -0500

[email protected] reports to me the following:

I finally got a test machine to try and push the bigio branch in shape to be merged into trunk at least.

In HDF5, when I have processes accessing a Big column in a dataset, I am seeing an OOM failure. I traced it down to MPI_File_set_view().

I do not see a problem with big rows, but that is because the HDF5 hyperslab algorithm to transform them into derived datatypes flattens out the 2nd dimension.

I replicated what the algorithm does in constructing the file type for a 2D dataset with 2 big columns to trigger the OOM in set_view.

Running the program with 2 procs should replicate.

mpirun -np 2 ./bigio_mpi test_file

This is with mpich 3.1.2

Oct 14 '16 18:10 mpichbot

Originally by robl on 2014-11-25 16:27:24 -0600

Attachment added: bigio_mpi.c (4.3 KiB) updated test case to clean up all resources. last version had a too-small buffer

Oct 14 '16 18:10 mpichbot

Originally by robl on 2014-11-26 14:33:36 -0600

This is going to be a tough one for ROMIO: the flattened representation is larger than the type itself. To describe the flattened representation, romio tries to allocate 4 billion bytes (twice)

68 if (flat->count) { 69 flat->blocklens = (ADIO_Offset *) ADIOI_Malloc(flat->count * sizeof(ADIO_Offs et)); 70 flat->indices = (ADIO_Offset *) ADIOI_Malloc(flat->count * sizeof(ADIO_Offset )); 71 }

This, on top of the 4 billion byte allocation from the calling program means we need a tremendous amount of memory to represent this type.

Here's the dataloop representation. dataloops (and their ability to process types in a piece-wise fashion) are the real answer to this workload.

# rank 1
# MPIU_Datatype_debug: MPI_Datatype = 0xcc000002 (derived)
# Size # 4,294,967,328, Extent8589934656, LB # 0(sticky), UB8589934656(sticky), Extent # 8589934656, Element Size1 (MPI_BYTE), is not N contig
# Contents:
# combiner: vector
# vector ct # 1, blk1073741832, str = 1 

#   combiner: resized 
# Dataloop:
digraph 0xccda18 {   {
   dl0 [shape # record, label"contig |{ ct # 1073741832; el_sz4; el_ext = 8 }"];
   dl0 -> dl1;

   dl1 [shape # record, label"blockindexed |{ ct # 1; blk1; disps # 4; el_sz4; el_ext = 4 }"];
   dl1 -> dl2;

   dl2 [shape # record, label"contig |{ ct # 4; el_sz1; el_ext = 1 }"];
   }       
}

Oct 14 '16 18:10 mpichbot

Originally by robl on 2016-05-16 21:07:54 -0500

Attachment added: bigio_viewonly.c (2.9 KiB) the bigio test stripped to only set the file view

Oct 14 '16 18:10 mpichbot

The size of flat is proportional to the number of contiguous blocks. We probably should set a threshold and use some generic fallback for the case such as reported here.

Aug 14 '22 22:08 hzhou

in 2005 Rob Ross wrote "mpitypes" code (also know as 'dataloop' code) which did a more clever job of maintaining datatype state and information. Instead of flattening the entire type the segment processing code could just return the "next" segments.

I'm not familiar with Yaksa but surely it's more clever than the flattening code. Rewriting ROMIO to use Yaksa would be a pretty big undertaking.

We'd need a significantly more robust test suite before embarking on any kind of ROMIO reengineering. (that's been on my list for years too)

I don't know where we'd put the generic fallback...because MPI decouples the file view and the memory type, and because file views are tiled, it's hard to find a natural place to split up this large noncontiguous type into smaller regions.

one peephole optimization for this specific case might be some kind of read-modify-write? I don't know where we'd put a temporary multi-gigabyte file though.

Aug 15 '22 14:08 roblatham00

There is typerep api (for both dataloop and yaksa) to get a list noncontig segments in a range as an IOV array. The fallback I have in mind is to call this API on demand rather than statically store the whole list. It is not efficient especially in yaksa, but as a fallback I think it is better than out-of-memory. At least it become a yaksa optimization issue.

Currently typerep APIs are not exposed. We need expose it via MPIX extension if we decide to do this.

Aug 15 '22 14:08 hzhou

The datatype in this issue is a contig type of count = 1073741832. For the purpose of file view, this is equivalent to a single inner type. I think this can be optimized.

Sep 07 '22 21:09 hzhou

in 2005 Rob Ross wrote "mpitypes" code (also know as 'dataloop' code) which did a more clever job of maintaining datatype state and information. Instead of flattening the entire type the segment processing code could just return the "next" segments.

I'm not familiar with Yaksa but surely it's more clever than the flattening code. Rewriting ROMIO to use Yaksa would be a pretty big undertaking.

With the MPIX_Type_iov extension (#6139), it is not difficult to do a dataloop-like code. However, all the current aggregation IO algorithms are tied to the flatlist (iov segment list). It will be a major overhaul to change the code structure -- yes, big undertaking.

In this particular example, the file type is essentially a huge "type_contig", which can be optimized into an equivalent single element. But I guess potentially there are valid cases with huge less-regular file type that make it worthwhile to fix. Although, we haven't heard another complaint for the last 6 years.

Also, the bigio that this issue originated from, is less relevant now that MPI-4 has large-count APIs.

Sep 07 '22 21:09 hzhou