ompi
ompi copied to clipboard
MPI_File_read_at_all doesn't read all values when certain datatypes are used
Background information
While testing a new feature in HDF5, we noticed that OpenMPI sometimes fails in a test that generates random I/O patterns in HDF5. I've tried to create a minimal pure MPI example that shows the failure.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Tested on 4.1.1, 4.1.2, 4.1.3, 4.1.4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
N/A
Please describe the system on which you are running
Tested on multiple Linux machines, running on a single node.
Details of the problem
The test program here should be run with 3 ranks. It consists of 2 phases: first all 3 ranks participate in a collective write, with rank 0 writing 5 integers to the file and ranks 1 and 2 writing nothing. Next the file is closed and reopened, then all 3 ranks participate in a collective read. Rank 0 reads all 5 integers without constructing a datatype. Rank 2 reads nothing. Rank 1 constructs a datatype using MPI_Type_contiguous, MPI_Type_vector, MPI_Type_create_hindexed, then MPI_Type_create_resized to select the middle 3 integers (this series of calls mimics how HDF5 constructs a datatype for hyperslab selections). Ranks 1 and 2 see the expected data, but rank 0 does not read the last integer. This test passes in MPICH.
I have verified that the file contains the correct data. The test also fails if you run it once to create the file (check it with a hex editor if you want), then comment out the write section and run again. If the program is run with 2 ranks it passes (even though the rank that is excluded by doing this (rank 2) does nothing but participate in collective calls). If rank 2 is modified to read the entire file like rank 0, then both ranks 0 and 2 fail to read the last integer.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <mpi.h>
#define TEST_ERROR \
do { \
printf("FAILED at line %d\n", __LINE__); \
abort(); \
} while(0)
/* Writes data to a shared file, with the data in the file consisting of 5
* contiguous integers, then reads it back collectively, and verifies that the
* data read is correct */
/* Run with 3 ranks */
int main(int argc, char **argv) {
MPI_Datatype elmt_type;
MPI_Datatype new_type;
MPI_Datatype tmp_type;
MPI_File fh;
MPI_Status mpi_stat;
MPI_Count bytes_written;
MPI_Count bytes_read;
MPI_Count type_size;
int buf[5];
int rank;
int count;
int i;
int ret_val = 0;
MPI_Init(&argc, &argv);
if(MPI_SUCCESS != MPI_Comm_rank(MPI_COMM_WORLD, &rank))
TEST_ERROR;
if(MPI_SUCCESS != MPI_File_open(MPI_COMM_WORLD, "topenmpi.mpio", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh))
TEST_ERROR;
/* Create type for int */
if (MPI_SUCCESS != MPI_Type_contiguous((int)sizeof(int), MPI_BYTE, &elmt_type))
TEST_ERROR;
/* Fill write buffer */
for(i = 0; i < 5; i++)
buf[i] = i + 1;
/* Set up datatypes for write */
if(rank == 0) {
count = 5 * sizeof(int);
}
else {
assert(rank == 1 || rank == 2);
count = 0;
}
/* Perform collective write operation */
if (MPI_SUCCESS != MPI_File_write_at_all(fh, 0, buf, count, MPI_BYTE, &mpi_stat))
TEST_ERROR;
/* Check the number of bytes written */
if (MPI_SUCCESS != MPI_Get_elements_x(&mpi_stat, MPI_BYTE, &bytes_written))
TEST_ERROR;
if(bytes_written != count)
TEST_ERROR;
/*-------- CLOSE/REOPEN --------*/
if(MPI_SUCCESS != MPI_File_close(&fh))
TEST_ERROR;
MPI_Barrier(MPI_COMM_WORLD);
if(MPI_SUCCESS != MPI_File_open(MPI_COMM_WORLD, "topenmpi.mpio", MPI_MODE_RDWR, MPI_INFO_NULL, &fh))
TEST_ERROR;
/*-------- READ --------*/
/* Set up datatypes for read */
if(rank == 0) {
new_type = MPI_BYTE;
count = 5 * sizeof(int);
}
else if(rank == 1) {
int block_len = 1;
MPI_Aint start_disp = 1 * sizeof(int);
/* Single block of length 3 (x sizeof(int)) */
if(MPI_SUCCESS != MPI_Type_vector(1, /* count */
3, /* blocklength */
1, /* stride */
elmt_type, /* old type */
&tmp_type)) /* new type */
TEST_ERROR;
/* Offset the block to the correct location (1 int to the right) */
if(MPI_SUCCESS != MPI_Type_create_hindexed(1, &block_len, &start_disp, tmp_type, &new_type))
TEST_ERROR;
MPI_Type_free(&tmp_type);
tmp_type = new_type;
/* Resize the type to cover the entire data block (5 ints) */
if(MPI_SUCCESS != MPI_Type_create_resized(tmp_type, 0, 5 * sizeof(int), &new_type))
TEST_ERROR;
MPI_Type_free(&tmp_type);
if (MPI_SUCCESS != MPI_Type_commit(&new_type))
TEST_ERROR;
count = 1;
}
else {
new_type = MPI_BYTE;
count = 0 * sizeof(int);
}
/* Set file view */
if (MPI_SUCCESS != MPI_File_set_view(fh, 0, MPI_BYTE, new_type,
"native", MPI_INFO_NULL))
TEST_ERROR;
/* Perform collective read operation */
memset(buf, 0, sizeof(buf));
if (MPI_SUCCESS != MPI_File_read_at_all(fh, 0, buf, count, new_type, &mpi_stat))
TEST_ERROR;
/* Check the number of bytes read */
if (MPI_SUCCESS != MPI_Get_elements_x(&mpi_stat, new_type, &bytes_read))
TEST_ERROR;
if (MPI_SUCCESS != MPI_Type_size_x(new_type, &type_size))
TEST_ERROR;
if(bytes_read != count * type_size)
TEST_ERROR;
/* Free the derived datatype */
if(rank == 1)
MPI_Type_free(&new_type);
/*-------- VERIFY --------*/
if(rank == 0) {
printf("rank 0 expected: {1, 2, 3, 4, 5} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
if(buf[0] != 1 || buf[1] != 2 || buf[2] != 3 || buf[3] != 4 || buf[4] != 5) {
printf("rank 0 verification FAILED\n");
ret_val = 1;
}
else
printf("rank 0 verification SUCCEEDED\n");
}
MPI_Barrier(MPI_COMM_WORLD);
if(rank == 1) {
printf("rank 1 expected: {0, 2, 3, 4, 0} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
if(buf[0] != 0 || buf[1] != 2 || buf[2] != 3 || buf[3] != 4 || buf[4] != 0) {
printf("rank 1 verification FAILED\n");
ret_val = 1;
}
else
printf("rank 1 verification SUCCEEDED\n");
}
MPI_Barrier(MPI_COMM_WORLD);
if(rank == 2) {
printf("rank 2 expected: {0, 0, 0, 0, 0} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
if(buf[0] != 0 || buf[1] != 0 || buf[2] != 0 || buf[3] != 0 || buf[4] != 0) {
printf("rank 2 verification FAILED\n");
ret_val = 1;
}
else
printf("rank 2 verification SUCCEEDED\n");
}
/*-------- CLOSE --------*/
if(MPI_SUCCESS != MPI_File_close(&fh))
TEST_ERROR;
MPI_Type_free(&elmt_type);
MPI_Finalize();
return ret_val;
}
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
shell$ mpirun -n 2 ./hello_world
Forgot to add: If all ranks read the entire file then the program passes. This implies that the complicated datatype on rank 1 is interfering with I/O on other ranks (but not the rank with the complicated datatype).
@edgargabriel Can you have a look?
@fortnern thanks for the report and the reproducer!
meanwhile, you can force using the ROMIO component instead of the (native) ompio one:
mpirun --mca io ^ompio -n 3 a.out
I will have a look, but it might take a couple of days until I get to it.
mpirun --mca io ^ompio -n 3 a.out
Test passes with ROMIO, looks like the problem is with OMPIO. Thanks!
@fortnern thank you for the bug report, I can confirm that it was a bug in ompio, introduced with the data sieving feature in the 4.1 release series. I will file a pr for 4.1 and the 5.0 releases once the code has been committed to the main repository. The 4.0 release or previous ompi releases did not have this feature.
Just for documentation purposes, it is possible to circumvent this issue in the 4.1 release by setting the fbtl_posix_read_datasieving to 0, e.g.
mpirun --mca fbtl_posix_read_datasieving 0 -np 3 ./testcode
@fortnern Looks like the fix was merged into the v4.1.x branch -- could you test the latest v4.1.x snapshot to see if the issue is fixed for you? https://www.open-mpi.org/nightly/v4.1.x/
Hi all, I just tested @fortnern's test program here, as well as some internal test programs, with the v4.1.x-202207280243-88bb627 snapshot. It runs successfully for me on a Slackware Linux system, whereas the programs failed for my 4.1.1 through 4.1.4 installations. We're planning to do some more internal testing but at first glance it seems @edgargabriel's PR fixed this for us. Thanks!
We've confirmed the test program works on the latest snapshot. Thanks! We'll test the HDF5 branch in question with it soon to verify.
The multi dataset branch of HDF5 that was failing with 4.1.2 passes with the latest master. Thanks again!