UnifyFS Example to create and access a PHDF5 dataset under Unify

I'm creating this example to document a simple tests that creates and accesses a simple dataset in Parallel-HDF5, under Unify.

The original example is taken from the HDF5 website: https://support.hdfgroup.org/ftp/HDF5/examples/parallel/Dataset.c

Based on that example, the changes made for Unify are these: a) Change the filename from SDS.h5 to ufs:/unifyfs/SDS.h5 b) Call to a function to mount /unifyfs after MPI_Init; c) Call to unifyfs_umount() at the end.

On Quartz, after "load module hdf5-parallel", I build this example with

h5pcc -o prog-gotcha mpi_prog.c -I${UNIFYFS}/include -L${UNIFYFS}/lib -lunifyfs_gotcha ${UNIFYFS}/lib64/libgotcha.so

This example, run with 8 processors on 2 nodes, used to work fine with the "new-margotree" branch of Unify. It is no longer working with the recent dev version (as of Nov.20/2020). It only works in one node.

This is the modified source code:

/*
 *  This example writes data to the HDF5 file.
 *  Number of processes is assumed to be 1 or multiples of 2 (up to 8)
 */

#include "hdf5.h"
#include "stdlib.h"

#define H5FILE_NAME     "ufs:/unifyfs/SDS.h5"
#define DATASETNAME     "IntArray"
#define NX     8                      /* dataset dimensions */
#define NY     5
#define RANK   2

int
main (int argc, char **argv)
{
    /*
     * HDF5 APIs definitions
     */
    hid_t       file_id, dset_id;         /* file and dataset identifiers */
    hid_t       filespace;      /* file and memory dataspace identifiers */
    hsize_t     dimsf[] = {NX, NY};                 /* dataset dimensions */
    int         *data;                    /* pointer to data buffer to write */
    hid_t       plist_id;                 /* property list identifier */
    int         i;
    herr_t      status;

    /*
     * MPI variables
     */
    int mpi_size, mpi_rank;
    MPI_Comm comm  = MPI_COMM_WORLD;
    MPI_Info info  = MPI_INFO_NULL;

    /*
     * Initialize MPI
     */
    MPI_Init(&argc, &argv); my_unify_mount();
    MPI_Comm_size(comm, &mpi_size);
    MPI_Comm_rank(comm, &mpi_rank);

    /*
     * Initialize data buffer
     */
    data = (int *) malloc(sizeof(int)*dimsf[0]*dimsf[1]);
    for (i=0; i < dimsf[0]*dimsf[1]; i++) {
        data[i] = i;
    }
    /*
     * Set up file access property list with parallel I/O access
     */
     plist_id = H5Pcreate(H5P_FILE_ACCESS);
                H5Pset_fapl_mpio(plist_id, comm, info);

    /*
     * Create a new file collectively and release property list identifier.
     */
    file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
              H5Pclose(plist_id);


    /*
     * Create the dataspace for the dataset.
     */
    filespace = H5Screate_simple(RANK, dimsf, NULL);

    /*
     * Create the dataset with default properties and close filespace.
     */
    dset_id = H5Dcreate(file_id, DATASETNAME, H5T_NATIVE_INT, filespace,
                        H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    /*
     * Create property list for collective dataset write.
     */
    plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

    /*
     * To write dataset independently use
     *
     * H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);
     */

    status = H5Dwrite(dset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL,
                      plist_id, data);
    free(data);

    /*
     * Close/release resources.
     */
    H5Dclose(dset_id);
    H5Sclose(filespace);
    H5Pclose(plist_id);
    H5Fclose(file_id);
unifyfs_unmount();
    MPI_Finalize();

    return 0;
}

/* --------------------------------------------------------------- */
#include <unifyfs.h>

int my_unify_mount()
{
    int ret = 0, rank, nranks;

    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    ret = unifyfs_mount("/unifyfs", rank, nranks, 0);
    if (ret) {
        printf("[%d] unifyfs_mount failed (return = %d)\n", rank, ret);
        exit(-1);
    }
}

/* --------------------------------------------------------------- */

Nov 23 '20 19:11 clmendes

@clmendes , after you gather the logs and add those, I'll work with you to get a debug build of MPI. That would be especially good if you have a corresponding MPI I/O problem that fails in a similar way.

Each time we hit a problem like this, it's a good opportunity use totalview for debugging. It's a bit of a learning curve, but getting good using that tool has a huge future payoff.

On a related note, were you building HDF from source before? I'd like to start putting together the commands needed to build HDF in full debug mode (-g -O0).

Nov 23 '20 21:11 adammoody

@CamStan , we should keep spinning you up on using totalview as a good-to tool, as well.

Nov 23 '20 21:11 adammoody

@adammoody , I was not using my HDF5 version, I've been using the system version (from module hdf5-parallel). Hence my comments that reproducing this in the tests by @CamStan would be a good way to check if the problem is my environment (e.g. my UnifyFS build) or not -- Cameron can use the sane HDF5 version, thus eliminating one source of differences among us.

clients-log.txt server-log-1.txt server-log-0.txt

I just ran again with P=2 on 2 Quartz nodes, and the problem persists: for this code, I get a crash that seems to be in one of the clients: files clients-log.txt, server-log-0.txt and server-log-1.txt

-rw------- 1 mendes3 mendes3 59887616 Nov 23 13:19 quartz12-progdataset-got-44337.core

These are the env-vars that I was using:

export UNIFYFS_DAEMONIZE=off
# [log]
export UNIFYFS_LOG_DIR=$jobdir/logs
export UNIFYFS_LOG_VERBOSITY=5
# [logio]
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 65536)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)
export UNIFYFS_LOGIO_SPILL_DIR=$MY_NODE_LOCAL_STORAGE_PATH
# [meta]
export UNIFYFS_META_DB_PATH=$MY_NODE_LOCAL_STORAGE_PATH
# [runstate]
export UNIFYFS_RUNSTATE_DIR=$MY_NODE_LOCAL_STORAGE_PATH
# [sharedfs]
export UNIFYFS_SHAREDFS_DIR=$mydir

I'm attaching to this entry the files for the two server logs and for the client log (the client log indeed has a bunch of error messages from HDF5 too).

Nov 23 '20 22:11 clmendes

And as a reference, when I use the same program but building with the previous 'new-margotree' branch of UnifyFS, and everything works fine, this is the client log (good-client-log.txt) that is obtained, using a job script similar to the previous one, i.e. running with 2 processors on 2 Quatz nodes.

good-client-log.txt

Nov 23 '20 22:11 clmendes

Related issue, but not the cause. I stumbled into a deadlock while testing for this problem: https://github.com/HDFGroup/hdf5/issues/118

Nov 24 '20 21:11 adammoody

There also seems to be a bug in ROMIO in adio/common/ad_resize.c:

void ADIOI_GEN_Resize(ADIO_File fd, ADIO_Offset size, int *error_code)
 {
     int err, rank;
     static char myname[] = "ADIOI_GEN_RESIZE";
 
     MPI_Comm_rank(fd->comm, &rank);
 
     /* first aggregator performs ftruncate() */
     if (rank == fd->hints->ranklist[0]) {
     ADIOI_Assert(size == (off_t) size); 
         err = ftruncate(fd->fd_sys, (off_t)size);
     }
 
     /* bcast return value */
     MPI_Bcast(&err, 1, MPI_INT, fd->hints->ranklist[0], fd->comm);
 
     /* --BEGIN ERROR HANDLING-- */
     if (err == -1) {
         *error_code = ADIOI_Err_create_code(myname, fd->filename, errno);
         return;
     }
     /* --END ERROR HANDLING-- */
 
     *error_code = MPI_SUCCESS;
 }

In the above, ftruncate returns -1 so that after the MPI_Bcast, err=-1 on all ranks. However, error_code gets set to a non-zero value on rank 0 and it is set to 0 on all other ranks. That's because errno=2 on rank 0, but errno=0 on other ranks.

Perhaps add a line to bcast the value of errno?

     if (err == -1) {
         MPI_Bcast(&errno, 1, MPI_INT, fd->hints->ranklist[0], fd->comm);
         *error_code = ADIOI_Err_create_code(myname, fd->filename, errno);
         return;
     }

Opened PR for MPICH here: https://github.com/pmodels/mpich/pull/4939

The MPICH PR has been merged.

Nov 25 '20 00:11 adammoody

I found that ftruncate was returning error code 2 from the server. Long story, short. It looks like our tree-based truncate is using mismatched types for input arguments. The root of the tree fills out a truncate_bcast_in_t struct, with what appears to be all correct values. However, it looks like the handler on the child process is receiving that as a truncate_in_t struct. It then misinterprets the gfid field and gets the value of 0 instead of the actual gfid of the file. The child then immediately returns ENOENT because it doesn't have any file entry for gfid=0.

Nov 25 '20 05:11 adammoody

I think this fixes the UnifyFS bug: https://github.com/LLNL/UnifyFS/pull/578

We were invoking the wrong truncate rpc when calling from a parent to a child in the broadcast tree.

Nov 25 '20 05:11 adammoody

just FYI, I was able to reproduce the truncate failure running the HDF5 example on Summit.

Nov 25 '20 16:11 MichaelBrim

@clmendes , so far my 4 process runs are working on catalyst. I've tried both 4 procs on 2 nodes and 4 procs on 4 nodes. There must be some other environmental difference between us.

Nov 25 '20 17:11 adammoody

I think I have a lead on why WRITE_SYNC=1 was helping before. There is no explicit sync in this program. In fact, by placing a breakpoint in our sync wrappers, I can verify that it never calls fsync when WRITE_SYNC=0.

The single H5Dwrite call in the program causes rank 0 to write 80 bytes starting at offset 2048, and rank 1 writes 80 bytes starting at offset 2128. So rank 1 writes the last bytes of the file.

Then during the H5Fclose call, each process issues three more write operations, but all of those are at lower offsets. After thsoe writes but still while in H5Fclose, rank 0 queries for the file size, expecting the file to be 2128+80=2208 bytes. Because there was no sync, it does not see the trailing bytes that were written by rank 1. Instead it sees the file as 2048+80=2128 bytes long, which accounts for the highest extent that rank 0 wrote to. Because of that mismatch, HDF then calls MPI_File_set_size to force the file back to the expected size of 2208 bytes. It's this corresponding ftruncate that then fails, because of the bug we had in UnifyFS. Bugs in ROMIO and HDF then cause things to go astray when trying to handle this failed ftruncate call, which eventually leads to the HDF crash.

When setting WRITE_SYNC=1, then rank 0 does see the expected file size of 2208 byes, since rank 1 flushed its data. In this case, HDF skips the call to MPI_File_set_size, which then avoided the buggy truncate path.

Nov 25 '20 18:11 adammoody

@adammoody , this example is now working for me too, with either 2 or 4 processors, and 2 nodes. I only got it to work yesterday after carefully looking at the run script that @CamStan had passed me the day before. In his script, after starting the server, Cameron has a "sleep 20" line.

My own script had "sleep 12" between starting the server and running the client. After I modified this to "sleep 15", my job then worked correctly: no execution errors, and a good SDS.h5 file produced.

I'm sure I had hit this problem in the past, but my recollection is that there was a hard-code wait in the source for 10 seconds, so waiting 12 seconds in the script should be enough, and it indeed was when I ran with only 2 processors. But perhaps Catalyst being a slower machine might present some issues.

Anyway, I'm now back to the real HDF5 test, which is still giving me trouble in the return from the dataset creation function, so I think there might be indeed another bug somewhere, as Adam feared.

Nov 27 '20 13:11 clmendes

I've spent most of this Friday working on some of the HDF5 tests from the chunkN collection (N=2,3,4,5), using both Unify from the old "new-margotree" branch and from the current dev version, created by PR#578. With "new-margotree", everything is still working as before (noting that for some tests, I have to insert the H5Fflush call in the app to get success). With the current dev version, some of the tests work, but cchunk5 still has problems when running on two nodes: it works fine in one node, but in two nodes, even with "sleep 20" in the job, it produces an error when trying to create a dataset.

I have isolated that "cchunk5" test into an independent program, which reproduces the same kind of error that I get with the testsuite. This is all in just 3 C files, which should make it easier to debug than the entire testsuite with its full collection of 51 tests.

Until we get this error debugged in Unify's current version, I'll limit myself to using the "new-margotree" branch, which has limitations (e.g. requires the H5Fflush in the apps) but at least it works correctly across multiple nodes!

Nov 28 '20 06:11 clmendes

Given that this simple HDF5 example now works, I'll create another issue specific to that cchunk5 example, which is still failing (in a different way). This will enable closing this issue #577, as it seems to work in Cameron's tests too, so the example has served its purpose.

I also used this example to report a problem I'm facing in the latest version of Recorder (after v.2.1.6), but that has nothing to do with Unify.

Nov 29 '20 16:11 clmendes

@clmendes - I don't think that my bugfix will help your problem with the cchunk5 test, but it's possible I'm wrong. Can you try building the version of HDF5 for this PR: https://github.com/HDFGroup/hdf5/pull/138 It's mainly targeted at datasets with compressed chunks, so I'm not very hopeful, but let me know if it does help, or helps to track things down.

Nov 30 '20 04:11 qkoziol

@qkoziol , thanks for pointing me this. I just gave it a try, but I still get the same error as before (i.e. the VRFY just after the call to H5Dcreate2 fails in Procs 0 and 1).

I downloaded your new branch, made sure that I had the correct H5Dmpio.c file, rebuilt HDF5, and rebuilt my example with it, but still have the same problem. My example right now just invokes the very first call to coll_chunktest(), and it already fails there.

Like I reported in our issue #580, the very first read operation (invoked inside the call to H5Dcreate2) goes fine on Ranks 2 and 3, but it fails on Ranks 0 and 1, for some still unknown reason.

Nov 30 '20 19:11 clmendes

Related issue, but not the cause. I stumbled into a deadlock while testing for this problem: HDFGroup/hdf5#118

@qkoziol , I meant to bring this topic up on one of our phone calls this week but managed to forget both times. I think there are a few spots where MPI I/O errors can cause HDF to deadlock or lead to other behavior, where rank 0 takes a different path than other ranks. This is low priority, since it is unlikely for those calls to ever fail, but I hit this a couple times since they are more likely to fail with unifyfs.

Dec 05 '20 06:12 adammoody