UnifyFS
UnifyFS copied to clipboard
MPI collective I/O and UnifyFS
With the collective write calls in MPI I/O, the MPI library may rearrange data among processes to write to the underlying file more efficiently, as is done in ROMIO's collective buffering. The user does not know which process actually writes to the file, even if they know which process provides the source data and file offset to be written.
An application may be written such that a given process writes twice to the same file offset using collective write calls. Since the same process writes to the same offset, the MPI standard does not require the application to call MPI_File_sync()
between those writes. However, depending on the MPI implementation, those actual writes may happen from two different processes.
As an example taken from PnetCDF, it is common to set default values for variables in a file using fill calls and then later write actual data to those variables. The fill calls use collective I/O, whereas the later write call may not. In this case, two different processes can write to the same file offset, one process with the fill value, and a second process with the actual data. In UnifyFS, these two writes need to be separated with a sync-barrier-sync to establish an order between them.
It may be necessary to ask users to do at least one of the following:
- set
UNIFYFS_CLIENT_WRITE_SYNC=1
if using collective write calls (one might still need a barrier after all syncs) - call
MPI_File_sync() + MPI_Barrier()
after any collective write call - disable ROMIO's collective buffering feature
Need to review the MPI standard:
- I don't recall of the top of my head what the standard says about
MPI_File_sync
in the case that the application knowingly writes to the same file offset from two different ranks using two collective write calls. IsMPI_File_sync
needed in between or not? - I'm pretty sure that
MPI_File_sync
is not required when the same process writes to the same offset in two different write calls.
Regardless, I suspect very few applications currently call MPI_File_sync
in either situation. Even if the standard requires it, we need to call this out.
The UnifyFS-enabled ROMIO could sync extents and then call barrier on its collective write calls. This would ensure all writes are visible upon returning from the collective write.
I happen to have this information since my current paper talks about MPI consistency model.
The MPI standard provides three levels of consistency:
- sequential consistency among all accesses using a single file handle. (e.g., only one process accesses the file)
- sequential consistency among all accesses using file handles created from a single collective open with atomic mode enabled
- user-imposed consistency among accesses other than the above.
So here we should be only worrying about the third case. In this case, MPI requires a sync-barrier-sync construct between the conflicting writes (from different processes). The construct can be one of the following:
- MPI_File_sync-->MPI_File_sync
- MPI_File_sync-->MPI_File_open
- MPI_File_close->MPI_File_sync
- MPI_File_close->MPI_File_open
Thanks @wangvsa . So then the app should have the sync-barrier-sync for the first situation above (1) - two different procs, but it's not required in (2) - same proc. I'm guessing most apps don't have it in either case, and UnifyFS might actually need it for both to work properly.
The apps themself rarely overwrite the same offsite (they rarely perform two collective calls on the same range). It is more likely the high-level libraries doing. E.g., HDF5 uses collective I/O to update its metadata. This is still not common though, I have tested several apps using HDF5 and they don't seem to have a consistency issue. I remember I checked HDF5's source code a while ago, and it seems to have adequate MPI_File_sync calls.
Right, hopefully it's not too common, and based on your earlier research we have some confidence in that. A few of the PnetCDF tests I've been running do encounter this kind of condition.
The fill call here conflicts with the put (write) calls later in the program: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/largefile/high_dim_var.c#L95
The test case reports data corruption under UnifyFS, because on read back, it finds the fill value rather than the expected data. When running with 2 processes, one process writes the fill data and the other writes the actual data.
The fill call here doesn't specify any kind of offset, so in this case, we could argue the PnetCDF user probably should call ncmpi_sync()
between the fill call and the later write calls in order to be compliant with the MPI standard. Alternatively, the PnetCDF library itself could be modified to call MPI_File_sync()
before it returns from the fill call so that user doesn't have to worry about it. Subsequent writes might conflict, and it's hard for the PnetCDF user to know, since they often don't deal with file offsets directly.
However, this got me thinking about potential problems with MPI collective I/O more generally.
Edit: Actually, on closer inspection, only rank 0 issues put (write) calls in this particular test case. I think the actual problem is that ranks try to read from the file before any earlier writes have been sync'd. The file should have been closed or sync'd before trying to read back data, I think even by PnetCDF semantics. So perhaps this test case is not really valid.
A second example from PnetCDF is the ncmpi_enddef
call here, which writes fill values to the file:
https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/tst_def_var_fill.c#L62
Later put calls conflict with that fill operation, and the test reports data corruption when using 2 ranks.
A workaround is to call ncmpi_sync()
after the ncmpi_enddef()
call and before the put calls.
While I'm at it, here are two other test cases I've found so far:
fill calls conflict with later puts: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/testcases/ivarn.c#L211-L218
implicit fill during enddef and later explicit fill call conflict with later put calls: https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/nonblocking/mcoll_perf.c#L512 https://github.com/Parallel-NetCDF/PnetCDF/blob/e47596438326bfa7b9ed0b3857800d3a0d09ff1a/test/nonblocking/mcoll_perf.c#L521
According to the pnetcdf document, "PnetCDF follows the same parallel I/O data consistency as MPI-IO standard". If this is the case, they should either set the atomic mode when opening an MPI File, or put enough sync-barrier-sync. Otherwise, I would argue they have consistency issues in their implementation, not just their test cases are invalid.
The default mode of PnetCDF intentionally does not call MPI_File_sync
everywhere since it can be expensive and is not needed on all file systems. I think the NC_SHARE
mode is meant to help force things, but it doesn't always work. PnetCDF notes that this calls MPI_File_sync
in more cases, but the documentation is not clear about which cases are covered.
https://github.com/Parallel-NetCDF/PnetCDF/blob/master/doc/README.consistency.md#note-on-parallel-io-data-consistency
PnetCDF follows the same parallel I/O data consistency as MPI-IO standard. Refer the URL below for more information. http://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node296.htm#Node296
Readers are also referred to the following paper. Rajeev Thakur, William Gropp, and Ewing Lusk, On Implementing MPI-IO Portably and with High Performance, in the Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, pp. 23-32, May 1999.
If users would like PnetCDF to enforce a stronger consistency, they should add NC_SHARE flag when open/create the file. By doing so, PnetCDF adds MPI_File_sync() after each MPI I/O calls.
- For PnetCDF collective APIs, an MPI_Barrier() will also be called right after MPI_File_sync().
- For independent APIs, there is no need for calling MPI_Barrier(). Users are warned that the I/O performance when using NC_SHARE flag could become significantly slower than not using it.
If NC_SHARE is not set, then users are responsible for their desired data consistency. To enforce a stronger consistency, users can explicitly call ncmpi_sync(). In ncmpi_sync(), MPI_File_sync() and MPI_Barrier() are called.
I did find this in the release notes for v1.2.0:
https://parallel-netcdf.github.io/wiki/NewsArchive.html
- Data consistency control has been revised. A more strict consistency can be enforced by using NC_SHARE mode at the file open/create time. In this mode, the file header is synchronized to the file if its contents have changed. Such file synchronization of calling MPI_File_sync() happens in many places, including ncmpi_enddef(), ncmpi_redef(), all APIs that change global or variable attributes, dimensions, and number of records.
- As calling MPI_File_sync() is very expensive on many file systems, users can choose more relaxed data consistency, i.e. by not using NC_SHARE. In this case, file header is synchronized among all processes in memories. No MPI_File_sync() will be called if header contents have changed. MPI_File_sync() will only be called when switching data mode, i.e ncmpi_begin_indep_data() and ncmpi_end_indep_data().
Setting NC_SHARE
helps in some of the test cases that are currently failing, but ivarn.c
still fails with 2 ranks on one node, in this case due to the fill calls and subsequent put calls. It seems like it would be helpful to call MPI_File_sync
after fill calls when NC_SHARE
is set. I think that would fix the failing ivarn.c
test case.
This does not directly apply, but I'll just stash this URL about NC_SHARE
and nc_sync()
from NetCDF (not PnetCDF) for future reference.
https://docs.unidata.ucar.edu/netcdf-c/current/group__datasets.html#gaf2d184214ce7a55b0a50514749221245
I opened a PR for a discussion with the PnetCDF team about calling MPI_File_sync
after fill calls when NC_SHARE
is set.
https://github.com/Parallel-NetCDF/PnetCDF/pull/107
@adammoody I'm trying to reproduce these conflicts. Which system and MPI implementation were you using?
I did most of the work on quartz, which uses MVAPICH2 as a system MPI library. Actually, I was using a debug build of MVAPICH so that I could trace into the MPI code. I'll send you the steps in an email on how I set things up.
I just tried ivarn and tst_def_var_fill using OpenMPI and mpich. They don't show any conflict on my side, all I/O calls are done internally using MPI_File_write_at_all (eventually only rank 0 does the pwrite()).