Problem to handle 4GB file
@kathrynmohror , @adammoody : As I reported in the call today, I'm having trouble with one of the Romio tests, large_file.c, thus I'm opening this issue to record the problem. That original code is available from
https://github.com/pmodels/mpich/blob/master/src/mpi/romio/test/large_file.c.in
This code works fine with one processor under plain MPI. It also works under Unify if the file has only 2GB, but fails at 4GB. I'm not sure that Unify can indeed handle files of size 4GB or larger, would any of you know?
The code has the following structure:
for (i=0; i<128; i++) { . . . MPI_File_write(fh,...); /* write 32 MB */ } MPI_File_get_size(fh,&size);
The "writes" seem to work fine, but after the records for those writes (completing 4GB) I see the following in the client log:
@ unifyfs_intercept_fd() [unifyfs.c:344] Changing fd from exposed 1024 to internal 0 @ invoke_client_sync_rpc() [margo_client.c:561] invoking the sync rpc function in client @ invoke_client_sync_rpc() [margo_client.c:570] Got response ret=12 @ unifyfs_sync() [unifyfs-fixed.c:206] failed to flush write index to server for gfid=1953131809
Hence, the writes seem to work fine, but the MPI_File_get_size operation fails.
This is the test code:
#include "mpi.h"
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#ifdef UNIFY
#include <unifyfs.h>
char *filename="ufs:/unifyfs/datafile-u" ;
int ret;
#else
char *filename="datafile-m";
#endif
/* writes a file of size 4 Gbytes and reads it back.
should be run on one process only*/
#define SIZE 1048576*4 /* no. of long longs in each write/read */
#define NTIMES 128 /* no. of writes/reads */
int main(int argc, char **argv)
{
MPI_File fh;
MPI_Status status;
MPI_Offset size;
long long *buf, i;
int j, myrank, nranks, len, flag, err;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
#ifdef UNIFY
ret = unifyfs_mount("/unifyfs", myrank, nranks, 0);
if (ret) {
printf("[%d] unifyfs_mount failed (return = %d)\n", myrank, ret);
MPI_Abort(MPI_COMM_WORLD, 1);
}
#endif
if (nranks != 1) {
fprintf(stderr, "Run this program on one process only\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
fprintf(stderr,
"This program creates an 4 Gbyte file. Don't run it if you don't have that much disk space!\n");
buf = (long long *) malloc(SIZE * sizeof(long long));
if (!buf) {
fprintf(stderr, "not enough memory to allocate buffer\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
MPI_File_open(MPI_COMM_SELF, filename,
MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
for (i = 0; i < NTIMES; i++) {
for (j = 0; j < SIZE; j++)
buf[j] = i * SIZE + j;
err = MPI_File_write(fh, buf, SIZE, MPI_DOUBLE, &status);
/* MPI_DOUBLE because not all MPI implementations define
* MPI_LONG_LONG_INT, even though the C compiler supports long long. */
if (err != MPI_SUCCESS) {
fprintf(stderr, "MPI_File_write returned error\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
sleep(2);
MPI_File_get_size(fh, &size);
fprintf(stderr, "file size = %lld bytes\n", size);
MPI_File_seek(fh, 0, MPI_SEEK_SET);
for (j = 0; j < SIZE; j++)
buf[j] = -1;
flag = 0;
for (i = 0; i < NTIMES; i++) {
err = MPI_File_read(fh, buf, SIZE, MPI_DOUBLE, &status);
/* MPI_DOUBLE because not all MPI implementations define
* MPI_LONG_LONG_INT, even though the C compiler supports long long. */
if (err != MPI_SUCCESS) {
fprintf(stderr, "MPI_File_real returned error\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
for (j = 0; j < SIZE; j++)
if (buf[j] != i * SIZE + j) {
fprintf(stderr, "error: buf %d is %lld, should be %lld \n", j, buf[j],
i * SIZE + j);
flag = 1;
}
}
if (!flag)
fprintf(stderr, "Data read back is correct\n");
MPI_File_close(&fh);
free(buf);
#ifdef UNIFY
unifyfs_unmount();
#endif
MPI_Finalize();
return 0;
}
@clmendes What client configuration settings are you using for this test? Specifically, what are the UNIFYFS_LOGIO_SPILL_SIZE and UNIFYFS_LOGIO_SHMEM_SIZE values?
I ask because the default values (1GB and 256MB) will not accommodate a 4GB file.
@MichaelBrim I am testing it with these settings in my job script:
[logio]
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 1048576) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 4300 * 1048576) export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 * 1048576) export UNIFYFS_LOGIO_SPILL_DIR=$MY_NODE_LOCAL_STORAGE_PATH
So, trying to get a bit more than 4GB in the shmem size, and no spill file.
@clmendes , do you see errors reported in the server log?
I see the client is reporting it got a return code of 12 from the server:
@ invoke_client_sync_rpc() [margo_client.c:561] invoking the sync rpc function in client
@ invoke_client_sync_rpc() [margo_client.c:570] Got response ret=12
@ unifyfs_sync() [unifyfs-fixed.c:206] failed to flush write index to server for gfid=1953131809
That maps to ENOMEM:
/usr/include/asm-generic/errno-base.h
#define ENOMEM 12 /* Out of memory */
I'm guessing a likely problem is that we may have exhausted the limit on the slice count when creating key/value pairs to insert into MDHIM, perhaps here? https://github.com/LLNL/UnifyFS/blob/6b422c3954cb988007293cf863f2b974d601cd5c/server/src/unifyfs_request_manager.c#L1523
@adammood I do not see a clear error message in the server log. This is what I see near the end of that log, just around the time when the code exits (notice the jump in the timestamps):
. . .
2020-08-10T14:26:38 tid=61805 @ rm_cmd_exit() [unifyfs_request_manager.c:1460] unlocking RM[1511587981:2] state
2020-08-10T14:26:38 tid=62167 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2050] RM[1511587981:2] got work
2020-08-10T14:26:38 tid=62167 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2054] unlocking RM[1511587981:2] state
2020-08-10T14:26:38 tid=62167 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2068] request manager thread exiting
2020-08-10T14:26:38 tid=61805 @ rm_cmd_exit() [unifyfs_request_manager.c:1439] locking RM[1511587981:3] state
2020-08-10T14:26:38 tid=61805 @ rm_cmd_exit() [unifyfs_request_manager.c:1460] unlocking RM[1511587981:3] state
2020-08-10T14:26:38 tid=62168 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2050] RM[1511587981:3] got work
2020-08-10T14:26:38 tid=62168 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2054] unlocking RM[1511587981:3] state
2020-08-10T14:26:38 tid=62168 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2068] request manager thread exiting
2020-08-10T14:26:40 tid=61793 @ exit_request() [unifyfs_server.c:152] exit requested
2020-08-10T14:26:40 tid=61793 @ main() [unifyfs_server.c:392] starting service shutdown
. . .
@adammoody , as suggested by @kathrynmohror in our latest call, I tested a slightly modified version of this example using two ranks, with two processors (in two distinct nodes, hence two Unify servers), such that each rank would execute every other write operation. Thus, the same data should go to the file. To do that, I changed the original MPI_File_write and used MPI_File_write_at, with a proper offset such that the same data would go to the same place in the file. After the execution, I compared byte-by-byte the resulting file with the 4GB file produced with plain MPI, and the contents matched precisely.
This modified code version worked fine: it produced a 4GB file that was correctly staged out, and from the client log we can see that the two clients took turns in writing to the file.
Thus, the problem seems to exist when one server has to handle alone the 4GB file. (Adam: this is in my subdir TEST4)
I also had another version, under Unify and one processor, that instead of using MPI-IO, uses fopen/fwrite/fclose (subdir TEST3). This also fails when we try to stage out the 4GB file!
As verified by @adammoody during the debug session at today's call, the error was caused by the server being unable to handle a file with a single extent of 4GB through slices of size 1MiB. I have then added, in the submission script, the following configuration setting:
UNIFYFS_META_RANGE_SIZE=4000000
The default value for this variable is 1MiB. With this new setting above, the program now works fine, and the file can be retrieved without any problem. Also, the resulting file has contents that match what is generated by plain MPI, without Unify.
This past week I finally had a chance to run again the large_file test program on one processor of the LASSEN system (IBM), and it worked just fine, even without any extra settings for UNIFYFS_META_RANGE_SIZE. I used a previous Unify build that I had done a while ago, based on PR#619 and the non-optimized Argobots build.
The Unify-based test is in the shared area /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Large_file/ (hopefully with group-permissions set correctly) but I removed the resulting data file (ofile) because it was big, 4GB. The contents of this data file did match what was produced by a non-Unify execution of the same program, though. The stderr from this execution is in file 'err' where one can find that, besides the Unify log entries, there are these lines that seem to indicate a correct execution:
This program creates an 4 Gbyte file. Don't run it if you don't have that much disk space!
file size = 4294967296 bytes
Data read back is correct