UnifyFS pthread readlock segfault

While searching for a simpler reproducer for another problem, which may be related, I hit a new segfault. I've written a test where processes interleave writing blocks of 1000 bytes, and each process writes 10,000 such blocks. Each process then reads back the entire file, reading in consecutive chunks of 1000 bytes.

Running a four process client job on two nodes, the server throws a segfault with the following stack trace:

Stack Trace
C    extent_tree_rdlock,  FP=2000680009d0 <-- same pthread lock (now called twice by same physical pthread?)
C    extent_tree_get_chunk_list, FP=200068000a60
C    unifyfs_inode_get_extent_chunks, FP=200068001b30
C    unifyfs_inode_resolve_extent_chunks, FP=200068002c80
C    find_extents_rpc,    FP=200068003e50 <-- new user-level thread spawned for find_extents_rpc
C    _wrapper_for_find_extents_rpc, FP=200068003ee0
C    ABTD_thread_func_wrapper_thread, FP=200068003f40
     make_fcontext,       FP=200068003f40
C    extent_tree_get_chunk_list, FP=000090 <-- blocked waiting on pthread read lock

From what I can see so far under TotalView, the server process was blocked waiting on a pthread read lock in extent_tree_get_chunk_list on this line:

https://github.com/LLNL/UnifyFS/blob/6010b5404071b4128b83328296865df309c0fdc0/server/src/extent_tree.c#L660

which calls pthread_rwlock_rdlock:

https://github.com/LLNL/UnifyFS/blob/6010b5404071b4128b83328296865df309c0fdc0/server/src/extent_tree.c#L469

Then it seems a user-level thread started up to service a find_extents_rpc request, which called the same extent_tree_get_chunk_list and then the same pthread read lock.

It seems calling that lock a second time caused the segfault?

Maybe we need to be using argobots locks instead or in addition?

Still just guessing on this.

Here is the client reproducer code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>

#include "mpi.h"
#include "unifyfs.h"

int main (int argc, char* argv[])
{
    int i;
    char mountpoint[] = "/unifyfs";

    MPI_Init(&argc, &argv);

    int rank, ranks;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &ranks);

    int ret = unifyfs_mount(mountpoint, rank, ranks, 0);

    char file[256];
    sprintf(file, "%s/testfile", mountpoint);

    size_t bufsize = 1024*1024;
    char* buf = (char*) malloc(bufsize);

    size_t chunksize = 1000;
    int num_chunks = 10000;

    int fd = open(file, O_WRONLY | O_CREAT | O_TRUNC);
    if (fd >= 0) {
        for (i = 0; i < num_chunks; i++) {
            off_t offset = ((off_t)i * (off_t)ranks + (off_t)rank) * (off_t)chunksize;
            pwrite(fd, buf, chunksize, offset);
        }
        fsync(fd);
        close(fd);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    fd = open(file, O_RDONLY);
    if (fd >= 0) {
        for (i = 0; i < num_chunks * ranks; i++) {
            off_t offset = (off_t)i * (off_t)chunksize;
            pread(fd, buf, chunksize, offset);
        }
        close(fd);
    }

    free(buf);

    unifyfs_unmount();

    MPI_Finalize();

    return 0;
}

To build for an Unifyfs install in ${install} that was built with bootstrap.sh:

mpicc -g -O0 -o lotsoreads lotsoreads.c -I${install}/include -L${install}/lib -Wl,-rpath,${install}/lib -L${install}/lib64 -Wl,-rpath,${install}/lib64 -lunifyfs_gotcha -L`pwd`/deps/install/lib64 -Wl,-rpath,`pwd`/deps/install/lib64

To run this, I reduced the chunk size (since there are 10,000 extents from each process).

export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 65536)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$((9 * 2 ** 30))
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

Apr 29 '21 01:04 adammoody

This one should go a way with my svgmgr_offload branch work. In that code, all reqmgr and svcmgr requests are handled by their respective threads, and not in the margo rpc handler.

Apr 29 '21 13:04 MichaelBrim

@MichaelBrim , oh cool. That would be slick.

Apr 29 '21 19:04 adammoody

So far, with my svcmgr_offload branch, I cannot reproduce the segfault (as expected). However, this test really demonstrates a particularly bad performance case for us (small writes plus everybody reading the whole file). On two Summit nodes (4 ppn), it takes roughly 40 minutes to get through all the 80,000 reads. We should think about adding some sort of server read cache to speed this up.

May 04 '21 21:05 MichaelBrim

see draft PR https://github.com/LLNL/UnifyFS/pull/627

May 04 '21 21:05 MichaelBrim

Awesome! That's great news.

Good note about the performance, too. I don't know how many apps might be doing this, but this is trying to emulate the pattern in the HDF test case.

In this particular test case, lamination should help if we added that. That would eliminate a bunch of server-to-server find extent rpcs in the read path.

Lamination could also help simplify read caching logic, though of course we don't have anything implemented for that yet.

May 04 '21 22:05 adammoody

@MichaelBrim, would this have been resolved by your Argobots threads work?

Oct 31 '23 22:10 CamStan

Yes, we no longer use pthread read-write locks on the structures that maintain file extents. Closing as resolved.

Nov 01 '23 13:11 MichaelBrim