UnifyFS icon indicating copy to clipboard operation
UnifyFS copied to clipboard

Problem to handle 4GB file

Open clmendes opened this issue 5 years ago • 7 comments

@kathrynmohror , @adammoody : As I reported in the call today, I'm having trouble with one of the Romio tests, large_file.c, thus I'm opening this issue to record the problem. That original code is available from

https://github.com/pmodels/mpich/blob/master/src/mpi/romio/test/large_file.c.in

This code works fine with one processor under plain MPI. It also works under Unify if the file has only 2GB, but fails at 4GB. I'm not sure that Unify can indeed handle files of size 4GB or larger, would any of you know?

The code has the following structure:

for (i=0; i<128; i++) { . . . MPI_File_write(fh,...); /* write 32 MB */ } MPI_File_get_size(fh,&size);

The "writes" seem to work fine, but after the records for those writes (completing 4GB) I see the following in the client log:

@ unifyfs_intercept_fd() [unifyfs.c:344] Changing fd from exposed 1024 to internal 0 @ invoke_client_sync_rpc() [margo_client.c:561] invoking the sync rpc function in client @ invoke_client_sync_rpc() [margo_client.c:570] Got response ret=12 @ unifyfs_sync() [unifyfs-fixed.c:206] failed to flush write index to server for gfid=1953131809

Hence, the writes seem to work fine, but the MPI_File_get_size operation fails.

This is the test code:

#include "mpi.h"
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#ifdef UNIFY
#include <unifyfs.h>
char *filename="ufs:/unifyfs/datafile-u" ;
int ret;
#else
char *filename="datafile-m";
#endif

/* writes a file of size 4 Gbytes and reads it back.
   should be run on one process only*/

#define SIZE 1048576*4  /* no. of long longs in each write/read */
#define NTIMES 128      /* no. of writes/reads */

int main(int argc, char **argv)
{
    MPI_File fh;
    MPI_Status status;
    MPI_Offset size;
    long long *buf, i;
    int j, myrank, nranks, len, flag, err;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
#ifdef UNIFY
    ret = unifyfs_mount("/unifyfs", myrank, nranks, 0);
    if (ret) {
        printf("[%d] unifyfs_mount failed (return = %d)\n", myrank, ret);
        MPI_Abort(MPI_COMM_WORLD, 1);
    }
#endif

    if (nranks != 1) {
        fprintf(stderr, "Run this program on one process only\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    fprintf(stderr,
            "This program creates an 4 Gbyte file. Don't run it if you don't have that much disk space!\n");

    buf = (long long *) malloc(SIZE * sizeof(long long));
    if (!buf) {
        fprintf(stderr, "not enough memory to allocate buffer\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    MPI_File_open(MPI_COMM_SELF, filename,
                            MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);

    for (i = 0; i < NTIMES; i++) {
        for (j = 0; j < SIZE; j++)
            buf[j] = i * SIZE + j;

        err = MPI_File_write(fh, buf, SIZE, MPI_DOUBLE, &status);
        /* MPI_DOUBLE because not all MPI implementations define
         * MPI_LONG_LONG_INT, even though the C compiler supports long long. */
        if (err != MPI_SUCCESS) {
            fprintf(stderr, "MPI_File_write returned error\n");
            MPI_Abort(MPI_COMM_WORLD, 1);
        }
    }

    sleep(2);
    MPI_File_get_size(fh, &size);
    fprintf(stderr, "file size = %lld bytes\n", size);

    MPI_File_seek(fh, 0, MPI_SEEK_SET);

    for (j = 0; j < SIZE; j++)
        buf[j] = -1;

    flag = 0;
    for (i = 0; i < NTIMES; i++) {
        err = MPI_File_read(fh, buf, SIZE, MPI_DOUBLE, &status);
        /* MPI_DOUBLE because not all MPI implementations define
         * MPI_LONG_LONG_INT, even though the C compiler supports long long. */
        if (err != MPI_SUCCESS) {
            fprintf(stderr, "MPI_File_real returned error\n");
            MPI_Abort(MPI_COMM_WORLD, 1);
        }
        for (j = 0; j < SIZE; j++)
            if (buf[j] != i * SIZE + j) {
                fprintf(stderr, "error: buf %d is %lld, should be %lld \n", j, buf[j],
                        i * SIZE + j);
                flag = 1;
            }
    }

    if (!flag)
        fprintf(stderr, "Data read back is correct\n");
    MPI_File_close(&fh);

    free(buf);
#ifdef UNIFY
    unifyfs_unmount();
#endif
    MPI_Finalize();
    return 0;
}

clmendes avatar Aug 03 '20 23:08 clmendes

@clmendes What client configuration settings are you using for this test? Specifically, what are the UNIFYFS_LOGIO_SPILL_SIZE and UNIFYFS_LOGIO_SHMEM_SIZE values?

I ask because the default values (1GB and 256MB) will not accommodate a 4GB file.

MichaelBrim avatar Aug 05 '20 18:08 MichaelBrim

@MichaelBrim I am testing it with these settings in my job script:

[logio]

export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 * 1048576) export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 4300 * 1048576) export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 * 1048576) export UNIFYFS_LOGIO_SPILL_DIR=$MY_NODE_LOCAL_STORAGE_PATH

So, trying to get a bit more than 4GB in the shmem size, and no spill file.

clmendes avatar Aug 06 '20 04:08 clmendes

@clmendes , do you see errors reported in the server log?

I see the client is reporting it got a return code of 12 from the server:

@ invoke_client_sync_rpc() [margo_client.c:561] invoking the sync rpc function in client
@ invoke_client_sync_rpc() [margo_client.c:570] Got response ret=12
@ unifyfs_sync() [unifyfs-fixed.c:206] failed to flush write index to server for gfid=1953131809

That maps to ENOMEM:

/usr/include/asm-generic/errno-base.h
#define ENOMEM          12      /* Out of memory */

I'm guessing a likely problem is that we may have exhausted the limit on the slice count when creating key/value pairs to insert into MDHIM, perhaps here? https://github.com/LLNL/UnifyFS/blob/6b422c3954cb988007293cf863f2b974d601cd5c/server/src/unifyfs_request_manager.c#L1523

adammoody avatar Aug 06 '20 05:08 adammoody

@adammood I do not see a clear error message in the server log. This is what I see near the end of that log, just around the time when the code exits (notice the jump in the timestamps):

. . .
2020-08-10T14:26:38 tid=61805 @ rm_cmd_exit() [unifyfs_request_manager.c:1460] unlocking RM[1511587981:2] state
2020-08-10T14:26:38 tid=62167 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2050] RM[1511587981:2] got work
2020-08-10T14:26:38 tid=62167 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2054] unlocking RM[1511587981:2] state
2020-08-10T14:26:38 tid=62167 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2068] request manager thread exiting
2020-08-10T14:26:38 tid=61805 @ rm_cmd_exit() [unifyfs_request_manager.c:1439] locking RM[1511587981:3] state
2020-08-10T14:26:38 tid=61805 @ rm_cmd_exit() [unifyfs_request_manager.c:1460] unlocking RM[1511587981:3] state
2020-08-10T14:26:38 tid=62168 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2050] RM[1511587981:3] got work
2020-08-10T14:26:38 tid=62168 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2054] unlocking RM[1511587981:3] state
2020-08-10T14:26:38 tid=62168 @ rm_delegate_request_thread() [unifyfs_request_manager.c:2068] request manager thread exiting
2020-08-10T14:26:40 tid=61793 @ exit_request() [unifyfs_server.c:152] exit requested
2020-08-10T14:26:40 tid=61793 @ main() [unifyfs_server.c:392] starting service shutdown
. . . 

clmendes avatar Aug 10 '20 22:08 clmendes

@adammoody , as suggested by @kathrynmohror in our latest call, I tested a slightly modified version of this example using two ranks, with two processors (in two distinct nodes, hence two Unify servers), such that each rank would execute every other write operation. Thus, the same data should go to the file. To do that, I changed the original MPI_File_write and used MPI_File_write_at, with a proper offset such that the same data would go to the same place in the file. After the execution, I compared byte-by-byte the resulting file with the 4GB file produced with plain MPI, and the contents matched precisely.

This modified code version worked fine: it produced a 4GB file that was correctly staged out, and from the client log we can see that the two clients took turns in writing to the file.

Thus, the problem seems to exist when one server has to handle alone the 4GB file. (Adam: this is in my subdir TEST4)

I also had another version, under Unify and one processor, that instead of using MPI-IO, uses fopen/fwrite/fclose (subdir TEST3). This also fails when we try to stage out the 4GB file!

clmendes avatar Aug 19 '20 13:08 clmendes

As verified by @adammoody during the debug session at today's call, the error was caused by the server being unable to handle a file with a single extent of 4GB through slices of size 1MiB. I have then added, in the submission script, the following configuration setting:

UNIFYFS_META_RANGE_SIZE=4000000

The default value for this variable is 1MiB. With this new setting above, the program now works fine, and the file can be retrieved without any problem. Also, the resulting file has contents that match what is generated by plain MPI, without Unify.

clmendes avatar Aug 20 '20 03:08 clmendes

This past week I finally had a chance to run again the large_file test program on one processor of the LASSEN system (IBM), and it worked just fine, even without any extra settings for UNIFYFS_META_RANGE_SIZE. I used a previous Unify build that I had done a while ago, based on PR#619 and the non-optimized Argobots build.

The Unify-based test is in the shared area /usr/workspace/scr/mendes3/LASSEN/ROMIO-TESTS/UNIFY/Large_file/ (hopefully with group-permissions set correctly) but I removed the resulting data file (ofile) because it was big, 4GB. The contents of this data file did match what was produced by a non-Unify execution of the same program, though. The stderr from this execution is in file 'err' where one can find that, besides the Unify log entries, there are these lines that seem to indicate a correct execution:

This program creates an 4 Gbyte file. Don't run it if you don't have that much disk space!
file size = 4294967296 bytes
Data read back is correct

clmendes avatar Aug 02 '21 00:08 clmendes